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Preface 



This volume contains a selection of papers presented during the 
biennial meeting of the Classification and Data Analysis Group 
(CLADAG) of the Societa Italiana di Statistica which was orga- 
nized by the Istituto di Statistica of the Universita degli Studi di 
Palermo and held in the Palazzo Steri in Palermo on July 5-6, 
2001. For this conference, and after checking the submitted 4- 
page abstracts, 54 papers were admitted for presentation. They 
covered a large range of topics from multivariate data analysis, 
with special emphasis on classification and clustering, computa- 
tional statistics, time series analysis, and applications in various 
classical or recent domains. A two-fold careful reviewing process 
led to the selection of 22 papers which are presented in this vol- 
ume. They convey either a new idea or methodology, present a 
new algorithm, or concern an interesting application. 

We have clustered these papers into five groups as follows: 

1. Classification Methods with Applications 

2. Time Series Analysis and Related Methods 

3. Computer Intensive Techniques and Algorithms 

4. Classification and Data Analysis in Economics 

5. Multivariate Analysis in Applied Sciences. 

In each section the papers are arranged in alphabetical order. 

The editors - two of them the organizers of the CLADAG confer- 
ence - would like to express their gratitude to the authors whose 
enthusiastic participation made the meeting possible and very 
successful. We also want to extend our thanks to the interna- 
tional group of reviewers for their diligence and the time spent 
in their professional refereeing work. Moreover, we are grateful to 
the chairpersons and discussants of the 13 sessions of the confer- 
ence, their comments provided useful suggestions for the authors 
and the audience. 
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Special thanks are due to the Local Organizing Committee 
from the University of Palermo that comprised 

Salvatore Bologna, Angelo M. Mineo, Antonella Plaia 

together with Antonino Mineo (coordinator) and Marcello Chiodi 
who have co-edited this volume. 

Finally, we would like to thank Dr. Martina Bihn from Springer 
Verlag for the excellent cooperation in publishing this volume. 

Palermo and Aachen Hans-Hermann Bock 

Marcello Chiodi 
Antonino Mineo 
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Part I 



Classification Methods with 
Applications 




The STP Procedure as Overfitting 
Avoidance Tool in Classification Trees 



Carmela Cappelli 1 and Francesco Mola 2 

1 Dipartimento di Scienze Statistiche, Facolta di Scienze Politiche Universita 
degli Studi di Napoli, Via L. Rodino, 1-80138 Napoli, Italy 

2 Dipartimento di Economia Facolta di Economia Universita degli Studi di 
Cagliari, Viale S. Ignazio n.17, 1-90123 Cagliari, Italy 



Abstract. Among tree classification methods it will be shown how STP algorithm 
(Cappelli et al, 2002) is an useful tool of overfitting avoidance. The overfitting prob- 
lem is the presence of ’’false” subdivision, that although reducing the total error, 
does not correspond to the true relationship between predictors and response vari- 
able. Classical methods based on overfitting evaluation by error terms seems not 
appropriated for our aim. It will be shown how STP procedure studing the dipen- 
dence between response variable and split variables, applied to both simulations and 
real examples can evaluate the presence of overfitting, preserving only significant 
subdivisions. 



1 Introduction 

Tree based classification and regression methodologies have proven 
to be a powerful nonparametric approach to data analysis leading 
to several methodological proposals and practical applications. 
These methods can be summarized as follows: let Y be a response 
variable, which can take values either in the real line or in a set 
of previously defined classes, and let Xi,X2, • • • ,X P be a set of 
covariates. The problem is to establish a relationship between Y 
and the covariates in the form of a binary tree. 

A standard tree construction arises from a divide and conquer 
algorithm which recursively divides the data space into two sub- 
regions (subgroups of observations) according to a splitting cri- 
terion. At each step (tree node) the splitting criterion selects the 
best covariate and cut point associated with it. The goodness of 
split at a given node is measured in terms of achievable separabil- 
ity among the sibling nodes, i.e., data partitioning is led by the 
aim of making the distribution of the response variable as pure as 
possible within the nodes. 




4 



C. Cappelli, F. Mola 



Because there is noise in the data, the structure resulting from the 
divide and conquer algorithm tends to be large and above all it 
overfits the data. As a consequence the tree has so many splits that 
its structure is difficult to interpret. Furthermore several splits, 
especially the terminal ones, are due to chance in the sense that 
they reflect particular features of the sample data rather than 
modeling real underlying relationships between the response and 
the covariates. 

The problem of overfitting in tree based methods has been widely 
discussed in literature (see for example Quinlan, 1986). The stan- 
dard approach to avoid overfitting (and besides complexity) is 
tree simplification which retrospectively prunes away some of the 
branches of the so called totally expanded tree. 

Pruning methods proposed in literature (for a review see Mingers, 
1989) are based on the evaluation of the classification/prediction 
accuracy. 

By discarding branches on the basis of their error pruning im- 
proves the tree accuracy and its understandability but, overfitted 
branches are removed purely by chance. 

In other words, since overfitting can be meant as ’’reporting false 
signals” then classical pruning cannot be seen as a tool to avoid it. 
In fact, it does not make use of any specific tool to verify whether 
signals are false. 

Also, as a means to improve accuracy, pruning has been out- 
performed by new strategies (Breiman, 1996,1998; Freund and 
Shapire, 1998)based on resampling procedures. These have proven 
to produce more accurate classifiers/predictors, but come at the 
expense of loosing the structure of the tree, i.e. , the knowledge of 
the splits. 

When knowledge acquisition (Hand, 1998), is a concern, these 
strategies cannot be employed and since classical pruning also 
fails, a different approach to simplification is needed. 

In the framework of the pioneer pruning method provided in the 
CART methodology (Breiman et al, 1984), a statistical test prun- 
ing procedure (STP) has been recently introduced by Cappelli et 
al (2002) both for calssification and regression. The test looks 
for that final tree structure which retains only those splits that 
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provide a significant contribution to explaining the relationships 
between the response and the covariates. 

This paper focus on the case of calssification and it reports the 
results of examples and applications used to evaluate the per- 
formance of STP as an overfitting avoidance tool relative to the 
classical CART pruning method. 

The paper is organised as follows: section 2 briefly describes the 
CART approach and the STP procedure for classification trees. 
Section 3 is devoted to some experiments. 

2 Simplifying Trees 

CART pruning method. CART approach is based on the con- 
cept of cost-complexity measure, which is a way to concentrate si- 
multaneously on both the critical aspects of size and accuracy. For 
any node t and for the subtree T t rooted at t the cost-complexity 
measure is defined as: 



R a (t) = R(t) + a, (1) 

R a (T t ) = R(T t ) + a\T t \, (2) 

where R(t) and R(T t ) are the resubstitution error measures at 
node t and at subtree T t respectively and \T t \ i.e. the number of 
leaves of the subtree. 

The two measures equals if: 



R(t) - R(T t ) 

‘ mi - 1 

From the (3) we see that a(t) gives the gain in accuracy per termi- 
nal node in the given branch. Therefore, when the two measures 
coincide, there is no point in retaining the subtree which increases 
the size of the tree without improving its accuracy. 

Although the complexity parameter takes a continuous range of 
values, the number of subtrees of the maximal tree is finite, thus 
the pruning process produces a finite sequence of subtrees. In 
other words, if T a is the optimal tree that minimizes the cost- 
complexity measure for a given value of a this continues to be 
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optimal as a increases until a threshold a is reached and a new 
tree T a > (with fewer terminal nodes) becomes the optimal one. 
This means that a tree is optimal for an interval of values of the 
complexity parameter and the sequence is generated by finding at 
each step the upper limit of the interval, i.e., the threshold given 
by the (3). 

The algorithm consists of two stages: first the sequence of pruned 
subtrees is generated pruning the node that contributes the least 
gain in accuracy, then a single tree is selected. This is the one 
that minimizes the the error rate estimated by dropping test ob- 
servations down the tree. This criterion is referred as 0 — SE rule. 
Since the minimum is highly unstable, the choice of the tree with 
the minimum error rate might be arbitrary. An alternative crite- 
rion, therefore, is the 1 — SE rule that selects the smallest tree 
whose error rate on the test set is within one standard error of 
the minimum. 

The idea of generating a sequence of trees based on size and accu- 
racy is undoubtfully appealing. The selection the optimum pruned 
subtree requires the space of all possible subtrees of the given to- 
tally expanded tree to be searched. This space is typically too 
vast; the cost complexity criterion represents a way to restrict the 
search to trees which are optimal. 

On the other hand, the selection criterion which takes into account 
solely the aspect of accuracy, appears insufficient to the aim of de- 
tecting the ’’model” that underlies the data and explains the true 
relation between the response and the covariates. 

In fact, the 0 — SE rule tends to select large trees retaining spuri- 
ous splits that accidentally reduce the error without being really 
related to the response, whereas, the 1 — SE rule in most cases, 
results in the selection of a very small tree, showing a bias toward 
overpruning. 



The STP procedure. Cappelli et al. (2001) Under the condition 
that the same measure, say /, employed both to grow and prune 
the tree, the complexity parameter can be proven to be: 
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l L ‘l leLt 

i.e., as the mean of the reduction in the measure I induced by the 
best split s* of each internal node l of the branch T t , in the set L t 
whose cardinality is \L t \ = \T t \ — 1. 

This result is the starting point of the definition of a statisti- 
cal testing procedure as the third stage in tree growing approach 
aimed to validate the pruning process, i.e., to find the reliable 
honest size tree. 

By replacing the error rate with the Gini index of heterogeneity 
in the pruning process a different complexity parameter can be 
defined: 



£(*) = 777 Ai ( s *’ ( 5 ) 

ieL t 

where p(l) = n(l)/n is the weight of node l\ this complexity pa- 
rameter is proven to be related to the x 2 distribution as follows: 

Jnfiit) ~ x\j-iy (6) 

The proposed procedure generates the sequence of pruned sub- 
trees according to the CART method, but, using the Gini index 
instead of the error rate. Since for any internal node the com- 
plexity parameter represents the gain in accuracy induced by the 
corresponding subtree, the statistic (6) can be used to test the 
significance of this gain, i.e, to verify whether the corresponding 
branch should have been kept or pruned. 

The testing considers an independent test set on whose obser- 
vations the complexity parameters are re-computed. The values 
of complexity parameter increase each step on the training set, 
whereas on the test set they do not form an increasing sequence. 
Therefore the result is a single final tree, that, typically, does not 
correspond to any tree in the CART sequence. 




8 



C. Cappelli, F. Mola 



3 Examples and Applications 

The simplest way to show how the STP procedure copes with 
overfitting is to study how it performs in terms of size of the fi- 
nal tree and splitting variables included in when applied to data 
drawn from a known model. 

To this aim we consider a very simple base model made up to pro- 
vide insights into the problem of simplifying the tree. The model 
is as follows: 



Vi — O.Olzu + 0.05^2 + 5iCj3 + Cj (7) 

where Xu, x 2 i, x^i are generated from the uniform distribution in 
the intervals [-1, 0], [0, 1] and [-0.5, 0.5] respectively and is drawn 
from a zero-mean homoschedastic normal: 

€i N(0, a t ) i = l,...,n. 

A dummy response variable y* is created according to the follow- 
ing rule of thumb: 



\ l if y < 0, 

\ 2 if y > 0 

Given the way the model is constructed, covariates X\ and X 2 play 
no real role in determining the response which depends exclusively 
on covariate x^. 

If no error term is added to the model, a classification tree would 
be characterized by splits involving only this covariate. By adding 
the error term, the relation becomes less and less evident depend- 
ing on <r e . and the 

We consider n = 1000 observations, randomly generated accord- 
ing to model(7) where the error term standard deviation has been 
set 0.1, 0.3, 0.7, 2, 3 respectively, resulting in 5 data sets. 

Figure 1 shows the boxplots of covariate X 3 for the noiseless case 
(e = 0) and the extreme hypothesis on the error term standard 
deviation (a e = 3) against the two response classes. 

The higher the error term standard deviation, the less strong 
the relation between the response and covariate £3; going from 
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the noiseless data to the extreme hypothesis {a t = 3) we see that 
the boxes in the plots are almost overlapped. In the application 
the data sets are randomly split into a training set (70%) and a 
test set (30For each data set the classification tree is built using 
the Gini index as impurity measure to evaluate the goodness of 
split, i.e. to grow the tree. The resulting totally expanded tree is 
pruned according to both the classical CART the STP procedure. 
The former creates the sequence of increasingly pruned subtrees 
aiming at minimize the error rate. The latter uses the Gini index 
in creating the sequence and then it validates at 
A tree is grown also for the noiseless data; only a single split along 
covariate x 3 fits the training set perfectly. 

The results are summarized in table 2 that, for each case, re- 
ports the number of terminal nodes and the splitting variables 
of the totally expanded tree, the final CART tree and the STP 
tree, respectively. In the STP procedure a significance level 0.01 
is considered. 
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Fig. 1 . The boxplots of covariate xs for e = 0 and cr e = 3 



As a general observation, trees are highly affected by noise. For 
o e = 0.1 the procedure grows a tree with 7 terminal nodes whereas 
in the noiseless case a single split on covariate x 3 is able to sep- 
arate the two classes. As a matter of fact, one-depth trees, i.e., 
trees able to provide a satisfactory fit to the data with a single 
split, are not uncommon in practical applications, as shown by 
Holte (1993). 

Increasing the error results in maximal trees more and more larger. 
This behavior is the consequence of the divide and conquer strat- 
egy. This, in the effort to model all the data, i.e, to make nodes 
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Table 1. Size and splitting variables included in the maximal trees, the CART 
trees (0 — SE rule) and the STP trees. 





Maximal trees 


CART trees 


STP trees 


Error term 


Size 


splitting 

variables 


Size 


splitting 

variables 


Size 


splitting 

variables 


U ~ AT(0,0.1) 


7 


Xi,X 2 ,X3 


4 


Xi,X 3 


2 


X 3 


ei ~ iV(0, 0.3) 


25 


Xl,X 2 ,X 3 


13 


Xi,X2,X3 


3 


£3 


ei ~ N(0, 0.7) 


64 


Xl,X2,X3 


19 


CO 

SS 

CM 


3 


£3 


£i ~ iV(0, 2) 


95 


Xl,X2,X3 


16 


Xi,X 2i X3 


2 


Z 3 


* ~ iV(0, 3) 


125 


Xi,X 2 ,X3 


23 


Xl,X 2i X3 


3 


X3 


e; ~ JV(0, 0.01X2 +0.05|a!i|) 


12 


Xi,X2,X3 


8 


Xl,X2,X3 


2 


X 3 



purer and purer, ends involving splits on covariates having no ex- 
planatory power respect to the response, namely x\ and x^. CART 
pruning reduces strongly the size of the totally expanded tree but 
since it is not meant to address overfitting, it is affected by it 
resulting anyway in overly large trees which include splits on the 
two ’’weak” covariates. 

On the contrary STP being based on the statistical evaluation of 
the dependence of the response from the splitting variable is able 
to detect meaningfulness splits and to avoid them. The size of the 
final trees is the same or slightly higher than that of the noiseless 
tree retaining only the higher levels splits on covariate x 3; even in 
the most extreme case, when the noise in the data creates almost 
a situation of inseparability between the two response classes, the 
test is able to detect the relation between the response and co- 
variate £3. Note that a further increase in the error term standard 
deviation, causes the tree methodology to fail and the first split 
is along covariate x\. 

An example with heteroscedastic error is also considered. Al- 
though the maximal tree shows a modest size (12 terminal nodes) 
it includes splits on covariates xli and X2 in the upper levels of 
the tree; the final tree selected by means of the 0 — SE rule with 
8 terminal nodes is comparatively large. According to the STP 
procedure only the first split (covariate X3) turned out to be sig- 
nificant. 
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Applications to real data sets are considered as well. The data 
are: the Wisconsin diagnostic breast cancer data (wdbc), from 
the UCI machine learning repository (http://www.ics.uci.edu/ 
~mlearn/MLRepository.html) the credit data and the Kyphosis 
data from the library of the software SPAD and SPLUS respec- 
tively. The wdbc data consists of 569 cellular samples, on which a 
binary response variable (malignant and benignant) and 30 real- 
valued features of the cell nuclei, were recorded. The credit data 
consists of 468 clients of a bank; the response is binary, reliable 
clients (those who paid their debts) and unreliable ones, there are 
11 nominal covariates. The kyphosis data has 81 instances rep- 
resenting children who have had corrective spinal surgery. The 
outcome is a binary variable (whether a postoperative deformity 
is ” present” or ’’absent”), to be predicted based on three numeric 
variables. 

Table 2 reports the results of the comparison. In this case, since 
the true underlying tree model is unknown, to evaluate the out- 
come of the CART selection rules and of the STP procedure, the 
AIC and the BIC model selection criteria are considered. Trees can 
be represented as contingency tables that cross-classifies the ter- 
minal nodes and the response classes. For these tables the AIC = 
— 2(maximised likelihood -n.of parameters) ranks the models in a 
way that is equivalent to use G 2 — 2 (df) where G 2 is the loglike- 
lihhod ratio test statistic; the criterion BIC = G 2 — log(n)(df) 
takes sample size into account. 

Table 2. Size and values of of the AIC and BIC criteria for the CART trees and 

the STP trees. 





Wdbc 


Credit 


Kyphosis 


methods 


Size 


AIC 


BIC 


Size 


AIC 


BIC 


Size 


AIC 


BIC 


0-SE 


12 


444.804 


401.463 


33 


240.904 


119.155 


10 


28.900 


10.513 


1-SE 


4 


392.053 


380.232 


10 


157.440 


123.303 


7 


27.730 


15.472 


STP 


4 


388.738 


376.917 


8 


125.913 


103.155 


2 


10.450 


8.407 



The STP procedure outperforms the CART selection rules, achiev- 
ing considerably smaller values of the criteria indicating better 
fits to the data. These results, therefore, confirm that the selec- 
tion rules proposed in the CART are somewhat arbitrary from 
the point of view of the fits provided by the tree model to the 
data. Note that the STP selects for all the examples smaller trees 
compared to CART but for the wdbc data. This case is particu- 
lar interesting since the 1 — SE tree and the STP tree have the 




12 



C. Cappelli, F. Mola 



same size but they differ for a single split. Nevertheless this slight 
difference in the structure of the tree causes a difference in the fit 
provided to the data showing that minimizing the error can be a 
misleading strategy. 

4 Concluding Remarks 

Statistical tests for pruning trees was proposed as a top down 
stopping rule to stop growing one depth branch (Mola e Sicil- 
iano, 1994; Lanubile e Malerba, 1998) or to prune retrospectively 
one depth branch (Mingers, 1989). The STP differs significantly 
from these since it takes advantage of the intuitive and appealing 
strategy proposed in the CART methodology combining it with 
a desirable statistical validation. It searches in the space of the 
CART-like pruned trees investigating the dependence relation of 
the response from the splitting variables. The experiments show 
that the STP is a promising approach to detect false relations re- 
ported in the growing phase, i.e., to overfitting avoidance. This is 
particularly needed when the knowledge of the explanatory splits, 
is a concern. 
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Abstract. This paper aims to presenting a new algorithm to classify symbolic 
data. The input data for the learning step is a set of symbolic objects described 
by symbolic interval (or set-valued) variables. At the end of the learning step, each 
group is represented by a (modal) symbolic object which is described by symbolic 
histogram (or bar-diagram) variables. The assignment of a new observation to a 
group is based on a dissimilarity function which measures the difference in content 
and in position between them. The difference in position is measured by a context 
free component whereas the difference in content is measured by a context depen- 
dent component. To show the usefulness of this modal symbolic pattern classifier, 
a particular kind of simulated images is classified according to this approach. 



1 Introduction 

In many domains, huge sets of data are recorded in large databases. 
A basic step of the Symbolic Data Analysis (Diday (2000)) con- 
sists in summarizing these data and extract new knowledge from 
them by means of symbolic data. Symbolic data are more complex 
than the standard one as they contain internal variation and they 
are structured data. The main goal of Symbolic Data Analysis is 
to apply new tools in this extracted knowledge in order to extend 
Data Mining in Knowledge Mining (Diday (2000)). Therefore, an 
extension of data analysis and statistical methods to the symbolic 
data is necessary. 

Ichino et al. (1996) introduced a symbolic classifier as a re- 
gion oriented approach based on Boolean symbolic data. In the 
learning step, each group is described by a disjunction of Boolean 
symbolic objects, which are obtained by using as tools of gener- 
alisation an approximation of the Mutual Neighbourhood Graph 
(MNG) and a Boolean symbolic operator (join). At the end of 
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this step, the classifier furnishes a complete and discriminant de- 
scription of the examples of each group. In the classification step, 
the allocation of an observation to a group is based on a match- 
ing function between an usual description and a Boolean symbolic 
description, which describes a group. 

Souza et al. (1999) and De Carvalho et al. (2000) have pro- 
posed a new approach in order to get a MNG approximation and 
to reduce the complexity of the learning step, without loss of 
the classifier performance (measured by the error rate of classifi- 
cation). This new approach has made possible to study a special 
kind of simulated images (SAR images). At the classification step, 
new context free similarity and dissimilarity measures have been 
introduced in order to compare a standard description of an indi- 
vidual with a Boolean symbolic description of a group. 

This paper aims to presenting an algorithm for building a sym- 
bolic classifier where each group is represented by a modal sym- 
bolic description. In such framework, a compromise dissimilarity 
function is proposed. It is based on two components: a location 
and content ones. The first component is supposed independent 
from the context while the second one is context dependent. 

2 Symbolic Object Descriptions 

In classical data analysis, an individual is described by a row of a 
data matrix (the columns are the variables), which assumes only 
one value from its domain. According to the kind of domain, a 
variable may be quantitative (discrete or continuous) or qualita- 
tive (ordinal or nominal). However, in the real world the recorded 
information are often too complex to be described by usual data. 
It is why different kinds of symbolic variables and symbolic data 
have been introduced. 

A symbolic variable is set-valued if, for an object (individual 
or class), it takes a subset of categories from its domain. In par- 
ticular, an interval variable takes, for an object, a continue set 
of values in a sub-set of $R, whereas a modal variables take, for 
an object, a non-negative measure (a frequency or a probability 
distribution or a system of weights). If this measure is specified 
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in terms of histogram, the modal variable is called histogram vari- 
able, whereas if the measure is specified by a bar diagram the 
modal variable is called bar diagram variable (Bock (2000)). 

In the context of the Symbolic Data Analysis, a symbolic de- 
scription of an object is a symbolic data vector whose columns are 
symbolic variables. These symbolic descriptions are collected in a 
Symbolic Data Table, which are more complex then the usual one 
because each cell contains more than a single value. Some weights 
can be associated to such intervals as well as logical rules and 
taxonomies can be also considered on the symbolic descriptors 
(Diday (2000)). 

Example 1 . A segment (set of pixels) may have as symbolic de- 
scription the symbolic data vector d s — ((0.7[70, 80[), (0.3[90, 120[) ) 
, where the characteristic grey level, which takes values in [70, 80[ 
and [90, 120[, is modelled as a histogram (or modal) variable, and 
0.7 and 0.3 are the relative frequencies of the two intervals of 
values. 

3 A Modal Symbolic Classifier 

In this section, we are going to describe one of the versions of 
the symbolic classifier based on modal descriptions that we have 
implemented. 

3.1 Learning Step 

Two steps constitute the learning process: pre-processing and gen- 
eralization. 

Pre-processing. The aim of this step is to represent each seg- 
ment as a modal symbolic description. The segments, obtained 
from a segmentation algorithm, are the input of the learning step. 
Each segment is constituted by a set of pixels, on which different 
levels of grey are measured. For each segment s we consider an 
interval I s representing the range of the grey level of the pixels. 

Example 2. Table 1 shows examples of some segments from two 
distinct regions of an image. In this table, a segment is described 
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by an interval variable which allows to represent the range of the 
grey levels of the set of pixels belonging to it. 



Table 1. Segments described by a symbolic interval variable 



Segments (set of pixels) 


Grey level (yi) 


Region 


Segi 


[10,30] 


1 


Seg 2 


[25,35] 


1 


Seg3 


[90,130] 


2 


Seg 4 


[125,140] 


2 



From the intervals describing the segments, we create a new 
set of intervals I T = {Ij}j eJ , such that Ij D ly = 0, V), j' G J, 
with j ^ /, as follows: at first, we take the set of values formed 
by every bound (lower and upper) of all the intervals associated 
to the initial segments. Then, such set of bounds is sorted in a 
growing way. Every interval of It is defined by two successive 
limits of this orderly set. 

Example 3. From Table 1, the new set of intervals is Ij< = 
{I u I 2 ,I 3 ,h,I 5 , h,I 7 } where h = [10, 25[, J 2 = [25, 30[, J 3 = 
[30, 35[, J 4 = [35, 90[, I 5 = [90,125[,/ 6 = [125, 130[ and I 7 = 
[130, 140], 

In this way, if I s is the interval which describes the segment s, 
then to every Ij € It it is verified one of the following conditions: 
Ij D I s = 0 or Ij C I s . Let J s = {j € J/Ij C I s }. A segment 
s can be represented by means of a modal symbolic description, 
determined as follows: 

ds = {igUksi Ik)k£Js) -t^ks ~ | j | ( 1 ) 

Z-^keJs r*4 

where \Ik\ is the range of the interval Ik- 

Example 4 ■ Table 2 shows the segments of Table 1 described 
by a symbolic histogram variable. 
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Table 2. Segments described by a symbolic histogram variable 



Segments (set of pixels) 


Grey level (y\) 


Region 


Segi 


((0.75[10,25[),(0.25[25,30[)) 


1 


Seg 2 


((0.50[25,30[),(0.50[30,35[)) 


1 


Seg 3 


((0.88[90,125[),(0.12[125,130[))] 


2 


Seg 4 


((0.33[125,130[), (0.67(130, 140])) 


2 



Generalization. This step aims to represent each Region as a 
modal symbolic object. The symbolic description of each region is 
a generalization of the modal symbolic description of its segments. 

Let S = {s,j}n e jv be a set of segments belonging to the same 
Region r, and D = {d n } ne jq their modal symbolic descriptions. 
Let d n = {(wj n Ij)j£j Sn ) be the modal symbolic description of 
s n . The Region r is similarly represented by a modal symbolic 
description as follows: 



d r = (Hn h) ke[jneN j, b ) , With Wkr=j^Yl Wkn ( 2 ) 

where #N is the number of segments belonging to the group r. 
For every segment s n there is an associated interval I Sn . We also 
associate to the Region r an interval I r which is defined as: I r = 
{J meM Im, where M = {j G J/Ij C I Sn for at least one n € N}. 

Example 5. Table 3 shows the modal symbolic descriptions of 
the Regions: 



Table 3. Regions described as a modal symbolic description 



Region 


Grey level ( yi ) 


1 


((0.38(10, 25[), (0.37(25, 30[), (0.25(30, 35[» 


2 


((0.44(90, 125[), (0.23(125, 130[), (0.33(130, 140])) 



3.2 Allocation step 

The affectation of a new observation (segment) to a group (region) 
is based on a dissimilarity function, which compares the symbolic 
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description of the new segment and the symbolic description of 
the region. The dissimilarity function, which will be described 
later on, measures the difference in contents and in position be- 
tween the segment and the group. This dissimilarity is a hybrid 
function: the difference in position is measured by a context free 
component, whereas the difference in content is measured by a 
context dependent component. 

Two steps also constitute the allocation process: pre-processing 
and assignment. 

Pre-processing. The aim of this step is to represent the new 
segment by a modal symbolic description. In this paper, this is 
achieved by associating to its interval a weight equal to 1. 

Assignment. Let x be a new segment with description d x = 
(( Wkxh)keK ) and r be a Region with description d r = (( uy4)jeJ ). 
As we know, the interval 4, associated to x, is such that: I x = 
U kexlk and Ik n I k > = 0, Vfc, k E K where k ^ k' . Let L x be the 
set of lower bounds of the intervals Ik, k € K, i.e., L x = {Lk\Ik = 
[Lk,Uk],k e K}, and let L r = {Lj\Ij — [Lj,Uj},j e J}. Let U x 
be the set of upper bounds of the intervals Ik,k € K, i.e., U x = 
{U k \h = [Lk, Uk], k e K}, and let U r = {Uj\Ij = [Lj,Uj\,j e J}. 

The context free component of the dissimilarity function is 
defined by: 



<Pcf{x,r) 



\i x ni r n (i x e i r )\ 
14 © 41 



( 3 ) 



where I x © I r is the join-interval of the two intervals I x and 4 
(Ichino et al (1996)), which is defined as 



4 © 4 = [min {min(L x ), min(L r )} , max {max(U x ), max(U r )}] 

( 4 ) 

The functions min and max furnish, respectively, the lowest 
and the greatest value among a set of values. 

The context dependent component of the dissimilarity function 
is defined by: 
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<f>cd{x, r 



|(E 

z keK 



| * H * 

1*1 



t 'y ^ 

jeJ 



I Ij n i x 

\I 3 \ 




( 5 ) 



The global dissimilarity function is given by the sum of the 
two components: 



4>(x, r) = 4> c f(x,r) + 4> cd (x, r) (6) 



4 The Monte Carlo Experiences 

Synthetic Aperture Radar (SAR) is a system that possesses proper 
illumination and produces images with high capacity to discrim- 
inate objects. It uses coherent radiation, generating images with 
speckle noise. Data SAR possesses a random behaviour that has 
been usually explained by a multiplicative model (Frery et al. 
(1997)). This model considers that the observed return signal Z is 
a random variable defined as the product of two random variables: 
X (the terrain backscatter) and Y (the speckle noise). 

A special kind of SAR simulated image will be classified ac- 
cording to the approach described on the previous section (Souza 
et al (1999), De Carvalho et al. (2000)). The process to obtain 
simulated images consists in creating classes of idealized images 
(a phantom, see Figure 1), and then, in associating to every classes 
a particular distribution. 

Different kinds of detection (intensity or amplitude format) 
and types of regions can be modelled by different distributions as- 
sociated to the return signal. Concerning the region types, the ho- 
mogeneous (e.g. agricultural fields), heterogeneous (e.g. primary 
forest) and extremely heterogeneous (e.g. urban areas) are con- 
sidered in this work. According to Frery et al. (1997) we assume 
that the return signal in amplitude case has the square root of a 
Gamma distribution, the K-Amplitude distribution and the G0- 
Amplitude distribution in homogeneous, heterogeneous and ex- 
tremely heterogeneous areas, respectively. 
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Fig. 1 . Phantom with five regions 



We generate the distributions associated to each class in each 
situation by using an algorithm for generating gamma variables 
(see Figure 2). 





Fig. 2. Densities of SAR images 



The parameters of these distributions are specified in the Table 
4 and 5. Moreover, the Lee filter (Lee (1981)) was applied to the 
data before segmentation in order to decrease the effect of the 
speckle noise (see Figure 3). 

The segmentation was obtained using the region growing tech- 
nique (Jain (1989)), based on the T-student test (at the 5% sig- 
nificance level) for the merging of regions. The Monte Carlo expe- 
rience was performed for images of size 512 x 512 and considering 
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Table 4. Distributions and parameters for situation 1 



Regions 


Distributions 




a 


p 


A 


7 


1 


pl/2 


42 


- 


1916 


- 


- 


2 


K-A 


84 


2 


- 


0.00023 


- 


3 


K-A 


168 


IT 


- 


0.00025 


- 


4 


pl/2 


126 


- 


17249 


- 


- 


5 


G°-A 


210 


^5 


- 




203987 



Table 5. Distributions and parameters for situation 2 



Regions 


Distributions 




a 


p 


A 


7 


1 


pl/2 


130 


- 


18361 


- 


- 


2 


K-A 


150 


5 


- 


0.00019 


- 


3 


K-A 


110 


4 


- 


0.00028 


- 


4 


pl/2 


160 


- 


27814 


- 


- 


5 


G%A 


190 


-5 


- 


- 


166982 




Situation 1 Situation 2 

Fig. 3. Filtered images 



two situations, ranging from moderate to great difficulty of clas- 
sification. 100 replications were obtained with identical statistical 
properties. 

The performance of the classifier was measured through the er- 
ror rate of classification, which was estimated by the Monte Carlo 
method. The estimated error rate of classification corresponds to 
the average of the error rates found for these replications. The 
results were, in mean, 2.05% and 17.27% for each considered sit- 
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uation. In the Table 6, the confidence interval was constructed 
using the T-student test. 



Table 6. Results of the Monte Carlo experiments 



SAR Images 


Average error 
rate (%) 


Standard 

deviation 


Confidence interval (a = 5%) 


Lower Bound 


Upper Bound 


Situation 1 


2.05 


0.0306 


1.98 


2.11 


Situation 2 


17.27 


0.1173 


17.08 


17.50 



5 Concluding Remarks 

We have presented a symbolic classifier based on modal symbolic 
descriptions. The usefulness of this classifier is corroborated by the 
good error rate under situations ranging from moderate to great 
difficulty of classification in the framework of a Monte Carlo ex- 
perience of simulated images. This approach improved those who 
were presented by De Carvalho et al. (2000), Souza et al (1999) 
and Ichino et al. (1996) in computational time complexity. Other 
approaches for the construction of a symbolic classifier based on 
symbolic modal descriptions are in progress. 
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Abstract. Several proximity measures have been proposed to compare classifi- 
cations derived from different clustering algorithms (Gordon (1999)). Few are the 
proposed solutions for the comparison of two classification trees. We have consid- 
ered particularly two of these: one (Shannon and Banks (1999)) is a distance that 
measures the amount of rearrangement needed to change one of the trees so that 
they result in an identical structure, while the other (Miglio (1996)) is a similarity 
measure that compares the partitions associated to the trees taking into account 
their predictive power. In this paper we analyze features and limitations of these 
proximity measures and suggest a normalizing factor for the distance defined by 
Shannon and Banks; furthermore we propose a new dissimilarity measure that con- 
siders both the aspects explored separately by the previous ones. 



1 Introduction 

Classification trees represent non parametric classifiers that ex- 
ploit the local relationship between the class variable and the 
predictors. They allow an automatic feature selection and a hier- 
archical representation of the measurement space. A typical seg- 
mentation procedure repeatedly splits the predictor space, gen- 
erally, in two disjoint regions according to a local optimization 
criterion; in order to control the effective complexity of the model 
and to avoid overfitting a pruning procedure follows the growing 
step (Breiman et al. (1984)). A binary tree is the representation 
of such a sequence of binary partitions. 

We can compare different classification trees in several situa- 
tions. For example, samples of trees can be obtained from indepen- 
dent data sets containing the same predictors and class variables 
as in multicentric studies (Shannon and Banks (1999)). Further- 
more, a multiplicity of these binary structures can derive from 
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different greedy search algorithms used for identifying trees. Mul- 
tiple versions of the same classifier can also be obtained over boot- 
strap datasets, and suitably combined in order to reduce the test 
set error (Breiman (1996), Freund and Schapire (1996)). 

A way to compare a set of trees can be based on the use of 
measures that evaluate the similarity or distance between trees. A 
proximity measure can be used, in these situations, not only when 
we need to combine the observed set of trees into a single inter- 
pretable one (Shannon and Banks (1999), Cappelli and Shannon 
(2000)), but also to represent geometrically the group of trees via 
multi-dimensional scaling and to cluster this set of trees. 

In this paper we analyze features and limitations of two prox- 
imity measures between classification trees: one is a distance that 
measures the amount of rearrangement needed to change one of 
the trees so that it has structure identical to the other (Shannon 
and Banks (1999)), while the other is a similarity measure that 
compares the partitions associated to the trees taking into account 
their relative predictive power (Miglio (1996)). We suggest a nor- 
malizing factor for the distance defined by Shannon and Banks; 
furthermore, we propose a new dissimilarity measure that con- 
siders both the aspects explored separately in the previous ones. 
The considered measures are then compared by analyzing some 
data sets in order to assess the capability of the new measure to 
detect differences in the structure and in the predictive power at 
the same time. 

2 Comparing Different Tree Structures 

The agreement of each tree Tj, i = l,...,v, with another tree 
Tj, j ^ i, can be assessed on the subset Dj used to grow Tj. On 
each data set Dj, j = 1, . . . , v, the corresponding tree Tj offers a 
description of the data, which is as close as possible to this subset. 
The agreement of Tj with Tj is then a measure of how well the 
description of Di applies to Dj. As a result, the stability measured 
in this way expresses the degree to which a description of a data 
set Di can be generalized to the others. This proposal comes from 
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Ciampi and Thiffault (1988), who select the tree which best agrees 
with the others. 

Two trees can be compared on a dataset D by measuring simi- 
larity between the two classification rules that the trees represent. 
This problem has its own counterpart in the comparison of two 
partitions of the same set of objects in the context of cluster anal- 
ysis. 

First we consider a similarity measure between classification trees. 

Let Tj and Tj be two different trees, with H and K leaves 
respectively, that can be used to classify a set of n observations. 
We may label from one to H the leaves of Tj and from one to K 
the leaves of Tj and form the matrix 

M = [m hk \ h = l,...,H, k = 1, . . . , K, 

where is the number of objects which belong both to the h-th 
leaf of T and to the £;-th leaf of Tj. 

In analogy with a measure defined in Fowlkes and Mallows (1983) 
for the comparison of two hierarchical clusterings, a numerical 
measure of the degree of similarity between the two trees is then 
defined to be 



where 



B{j 



E 



H 

h — 1 



E 



K 

k = 1 



m 



2 

hk 



— n 



\J PiQj 



(i) 



K H H K 

m h0 = ^2 m hk\ m 0k = ^2 m hki Pi = ^2 m M - n ; Qj = m 0 fc _n - 

k - 1 h = 1 h = 1 k = 1 

The Bij value is bounded between 0 and 1. Bij is equal to 1 when 
M has exactly min(if, K) elements that differ from 0, that means 
the leaves in each tree correspond completely. On the contrary, 
B^ equals zero when each rrihk is 0 or 1, so that every pair of 
objects that appear in the same leaf in Tj are assigned to different 
leaves in Tj . B^ is also related to the sum of those pairs that have 
matching leaves assignment. 
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If we want to compare two trees from the point of view of 
their associated partitions and predictions, it seems natural to 
modify this measure of similarity. The tree, in fact, represents a 
classification rule and the comparison of two partitions derived 
by different trees should take into account whether the elements 
of the partitions assign the same class label to each object. Let 
Chk = 1 if the h - th leaf of has the same class label as the fc-th 

leaf of Tj, and Chk — 0 otherwise. The similarity measure (1) has 
been modified as follows (Miglio (1996)): 



B(T„T,) = 



E H sr^K o 'sr^H sr^K 

h = 1 1 m hk C hk ~ L^h= 1 lsk = 1 m hkChk 

\J P i Qj 



( 2 ) 



where the numerator counts the pairs of objects which are classi- 
fied together in both partitions and that are also assigned to the 
same class. 

Shannon and Banks (1999) proposed a novel method for com- 
bining a sample of classification trees using maximum likelihood 
estimation which results in a single, interpretable tree. 

Their method requires the computation of distances between 
trees; thus they proposed also a distance metric between two trees 
that measures the amount of rearrangement needed to change 
one of the trees so that it has structure identical to the other. 
Specifically, the distance between trees Tj and Tj is defined by 
Shannon and Banks as 



d(Ti,Tj ) = a r \Sij r \, (3) 

r 

where Sij r denotes the set composed by discrepant r-paths (paths 
of length r found in only one of the two trees), and a r = a(r) is 
a weight function that depends only on r. 

The weight function can be selected to penalize structural differ- 
ences between trees based on where these occur. For example, a 
constant weight function ( a r = 1) does not distinguish between 
differences occurring early or late in the trees, while the function 
a r = 1/r penalizes early discrepancies more heavily than those 
occurring later in the tree. A different choice of this component 
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allows to modify the distance metric to take into account expert 
opinions. 

The value of d{T,Tj ) cannot be easily interpreted on its own to 
describe the intensity of the distance (such as to compare the dis- 
tance between different pairs of trees) since it is highly dependent 
on the number of paths present in Tj and Tj and on the lengths of 
each of them. To overcome this limitation, the value of d(Ti,Tj) 
should be divided by a normalizing factor maxd(Tj,7}) obtained 
computing the distance between a pair of trees having exactly the 
same number of paths of ( ) with the same lengths, but con- 
structed so as to result discrepant. This can be easily obtained 
by choosing a splitting variable for the root node in Tj different 
from the one in Tj . In this way we obtain the following normalized 
measure: 






d(T„T,) 
ma xd(Tj, Tj) 



( 4 ) 



Let us consider the characteristics of the previous measures. 
On one hand the similarity measure (2) compares the partitions 
associated to the trees taking into account their relative predic- 
tive power without taking into explicit consideration the variables 
identified as more discriminant. On the other hand, the distance 
(3) evaluates the difference between the structures of T, and Tj 
with no regard for their associated partitions and predictions. 
When two classification trees have to be compared, both aspects 
(the structure and the predictive power) should be simultaneously 
considered. In fact, trees having the same distance with respect to 
their structures can show a very different predictive power. On the 
other hand, trees with the same predictive power can have very 
different structures. For this reason, we propose a new dissimi- 
larity measure, which considers the structure and the predictive 
power at the same time. Its definition is the following: 



H 



h — 1 



s c Ti,Tj ) = £> iA (l - s ih )^ + 5>i*(l - S jk ) 



K 



, ^0/c 



k = 1 



n 



( 5 ) 
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where and Sjk are similarity coefficients whose values synthe- 
size the similarities Shk between the H leaves of Tj and the K 
leaves of Tj, computed as follows: 



$hk 



WlhkChk 

sjmhomok 



h = 1, . . . , H, k — 1, . . . , K] 



( 6 ) 



is equal to 0 when rrihk = 0 and/or Chk = 0, and equal to 1 
when Chk = 1 and both the h-th leaf of Tj and the &-th leaf of Tj 
have exactly the same objects, that therefore are assigned to the 
same class. When Tj and Tj are equal, each leaf of Tj will result 
maximally similar to a particular leaf of Tj, and minimally similar 
to any other. For this reason, a suitable way of synthesizing the 
similarities Shk seems to be choosing the maximum: 



s ih = max{shk, k = 1,..., K}, (7) 



s jk = max{shk, h = l,...,H}. (8) 

The coefficients ctih and are dissimilarity measures: the first is 
computed between the h - th leaf of Tj and that particular leaf of Tj 
identified by the criterion (7), while the second between the A;-th 
leaf of Tj and that particular leaf of Tj identified by the criterion 
(8). If the paths associated to these pairs of leaves are identical, 
aih (&jk) is set equal to 0. On the contrary, cq/j (oijk) has to be set 
greater than 0 on the basis of the values of q and p, which denote 
the length of the longest path and the level where the two paths 
differ from each other, respectively. Specifically, a ^ (a,-*.) can be 
set equal to q — p + 1 when it is useless to distinguish between 
differences occurring early or late in the paths, or equal to YH=p J 
when early discrepancies have to be penalized. Furthermore, the 
introduction of the relative frequency of each leaf weights the 
discrepancies proportionally to the number of their observations. 
The maximum value of 8{Ti,Tj) can be reached when the differ- 
ence between the structures of Tj and Tj is maximum and the 
similarity between their predictive powers is zero. The normaliz- 
ing factor for 5(Ti,Tj ) is thus equal to 
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max S ( Ti , Tj) = ^ 

h = 1 



K 

+ 'YL a i k 

k = 1 



^0 k 

i 

n 



(9) 



where cq/j is set equal to (the length of the path from the 
root node to the h - th leaf of Tj ) when a set of constant weights 
is considered, and equal to y otherwise; ctjk is defined in an 
analogous way. The normalized version of the proposed dissimi- 
larity is thus: 



A(T i ,T,) = 



max 8 (Ti, Tj) 



( 10 ) 



3 Experimental Results 

In this section we analyze the performance of the proposed mea- 
sure by studying two different classification problems. 

3.1 Threenorm example 

The first example is a simulation study based on a two-class data 
set with 6-dimensional measurement vectors. Class 1 is drawn 
with equal probability from a unit multivariate normal with mean 
(a, ..., a) and from a unit multivariate normal with mean (—a, ..., 
—a). Class 2 is drawn from a unit multivariate normal with mean 
(a, —a, a, —a), where a = 2/(20) 1,/2 . Sixty training samples of 
500 observations each were generated and 10000 observations have 
been considered as test set. These trees were fit using a procedure 
like CART (Breiman et al. (1984)) implemented in GAUSS, with 
the Gini criterion for splitting and using the test set for pruning. 
Comparing pairs of the 60 trees obtained in this way, we have 
computed 1770 values for the measures (2), (4) and (10), consid- 
ering both ways of weighting discrepancies for (4) and (10). 

3.2 Breast cancer example 

The second example is a real classification problem, where the 
objective is to correctly identify benign from malignant breast 
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tumors (Mangasarian and Wolberg (1990)). It is a two-class prob- 
lem. Each of the 699 instances consists of 9 cell attributes each 
of which is measured on a 1-10 scale plus the class variable. The 
Wisconsin Breast Cancer Database is in the UCI repository of ma- 
chine learning databases (ftp.ics.uci.edu/pub/machine-learning- 
databases) and was obtained from the University of Wisconsin 
Hospitals. We generated 50 training sets of 600 observations, boot- 
strapping the original one, while the whole data set has been con- 
sidered as a test set. Like in the previous example, also these trees 
were fit using a procedure like CART (Breiman et al. (1984)) im- 
plemented in GAUSS, with the Gini criterion for splitting and 
using the test set for pruning. Comparing pairs of the 50 trees on 
the test set we have obtained 1225 values for the measures (2), 
(4) and (10), considering both ways of weighting discrepancies for 
(4) and (10). 

3.3 Results 

Table 1 reports two Spearman correlation matrices between the 
considered measures, with respect to constant and decreasing 
weights, for the two examples. The comparison between mea- 
sures has been done using an ordinal correlation measure instead 
of Pearsons coefficient because the empirical distributions of the 
proximities come closer to the bounds for some measures than oth- 
ers; for this reason, the relative ranks of the proximities should be 
compared instead of their absolute values. The values of B and D 
measures result to be not weakly correlated; in our opinion those 
correlations would be higher if we did not take into account ev- 
ery pair of trees which split the predictor space in very similar 
regions but using a different rule order. At least in the examples 
considered, the predictive power and the structure seem to be 
two aspects of a tree which are conceptually very different but 
which in practice result related to each other. For this reason we 
think that, when the dissimilarity between two trees is measured, 
the differences in predictive power and structure should be better 
evaluated by a single measure which is not influenced by pairs of 
trees discrepant only in the rule orders. In both cases the correla- 
tions between the proposed measure and the other two are greater 
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Table 1. Spearman correlations between the considered measures. 







Constant weights 


Decreasing weights 






B 


D 


A 


B 


D 


A 


Threenorm 


B 


1.000 


-0.391 


-0.792 


1.000 


-0.393 


-0.808 




D 




1.000 


0.634 




1.000 


0.642 




A 






1.000 






1.000 


Breast cancer 


B 


1.000 


-0.557 


-0.732 


1.000 


-0.553 


-0.707 




D 




1.000 


0.805 




1.000 


0.864 




A 






1.000 






1.000 



than the correlation between B and D; this is an expected result 
that we think is due to the capability of the new measure to de- 
tect differences in the structure and in the predictive power at the 
same time. 

We have also studied how the distribution of the proposed mea- 
sure changes when the similarity (2) grows up taking the dissimi- 
larity (4) constant and, viceversa, when the dissimilarity (4) grows 
up taking the similarity (2) constant. Figure 1 reports the box- 
plots of the proposed measure distributions between the groups 
corresponding to the quintiles of the measure (2) within each quin- 
tile of (4) for the breast cancer example: with respect to the mea- 
sure (4) it is possible to determine only three groups because 
third and fourth quintiles are equal. The obtained results show 
that fixing the dissimilarity (4) the mean of the proposed mea- 
sure decreases according to the increasing of the similarity with 
respect to the predictive power; a similar but opposed result fol- 
lows considering situations of constant similarity (2): the mean 
values of the proposed measure increase when the dissimilarity 
(4) decreases. 

4 Concluding Remarks 

In this paper we explore features and limitations of two proxim- 
ity measures between classification trees. We also propose a new 
dissimilarity measure which overcomes some of the limitations 
highlighted. 
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Fig. 1. Boxplots of the proposed measure distributions (constant weights). 



Similarly to some measures of partition correspondence, each prox- 
imity measure between classification trees should also be corrected 
for chance (Hubert and Arabie (1985)) so as to take on some con- 
stant value under an appropriate null model. 

We are presently examining the extension of the proposed mea- 
sure to compare other types of trees (e.g.: regression trees, survival 
trees). We are also studying algorithms, based on the proposed 
measure, suitable for identifying a consensus tree which summa- 
rizes the information contained in different trees. When the trees 
are constructed on cross-validation samples, this strategy could 
represent an alternative to pruning techniques. 
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Abstract. We introduce a new criterion for generating classification trees in the 
case when the response variable is ordered categorical. The framework is the well 
known CART methodology (Breiman et al. (1984)), which is actually the bench- 
mark for the case when the response is nominal or numerical. Our criterion is 
obtained by measuring impurity within a node referring to a general measure of 
mutual dispersion (Gini (1954)), which can be applied to every variable, whatever 
its nature. We show that the two measures of impurity at the basis of CART in 
the case when the response is nominal or quantitative can be obtained as particular 
cases of the above mentioned measure. We also illustrate a property of our criterion, 
permitting the application of an algorithm recently proposed by Mola and Siciliano 
(1997, 1998) to fasten the process of growing the maximal tree. 



1 Introduction 

Consider the problem of predicting a response variable, Y, on the 
basis of a set of explanatory variables. The Classification And 
Regression Trees (CART) methodology of Breiman et al. (1984) 
addresses this problem by recursively partitioning the initial set 
of N cases into subsets being more and more homogeneous (with 
respect to the response variable). This is done by referring to the 
concept of impurity (heterogeneity) within a node and of decrease 
of impurity deriving by splitting a set of cases into two subsets. 
In particular, this kind of strategy has been defined in CART for 
the case when the response variable is nominal or quantitative. 
Our aim is to provide a new impurity measure within the CART 
splitting criterion in order to deal with the case when the lev- 
els of Y can be ordered. This is accomplished by referring to a 
measure of the dispersion of a variable introduced by Gini (1954). 
In particular, the impurity measures at the basis of CART (in 
the nominal and in the quantitative case) can be showed to be a 
particular case of the mentioned measure. 
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2 Classification Trees: CART Methodology 

Consider a set of N cases, and suppose measurements are avail- 
able on Q variables Xi, X 2 , . . . , Xq. Suppose the N cases be- 
long to G groups induced by the levels of a response variable, 
Y, which has to be predicted on the basis of the Q explanatory 
variables. Tree-structured classification addresses this problem by 
constructing a prediction rule in the form of a tree obtained by 
recursively partitioning the group constituted by all of the N 
cases into subsets, called nodes , through a sequence of binary 
splits. The classification tree approach of CART (Breiman et al. 
(1984)) consists into two main phases. In the growing phase, nodes 
are recursively split, and a sequence of nested trees is obtained, 
{Ti} = {Ti, T 2 , . . . , T m }, having an increasing number of terminal 
nodes (i.e., nodes which are no longer split). This phase ends when 
the maximal tree , Tm, is obtained, a tree whose terminal nodes 
are either maximally T-homogeneous (all cases are characterised 
by the same level of Y, and further splitting is not worthwhile) 
or maximally X-homogeneous (all cases are characterised by the 
same levels of the explanatory variables, and further splits are 
not possible) or are constituted by very few cases. In the pruning 
phase terminal nodes are sequentially merged to avoid over-fitting 
of the classification tree to the N cases used to grow it. 

The focus of this work is on the growing phase. At the basis of 
this phase are the concepts of impurity (or heterogeneity) within a 
node and of decrease of impurity deriving by splitting a node into 
two sub-nodes. A node is split into two sub- nodes according to 
the values of one out of the explanatory variables. More precisely, 
all the possible sub-divisions (splits) which can be defined on the 
basis of the explanatory variables are evaluated, and the best split, 
s* , is selected maximising the decrease in impurity. In CART two 
measures of impurity are defined to deal with the case when the 
response is categorical (nominal) or numerical. In the following 
section we will illustrate that these two measures can be seen 
as particular cases of a general measure of mutual dispersion of 
a variable. Our aim is to provide a criterion for generating the 
sequence {Ti} in the case when Y is a categorical ordered variable. 
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This will be done by suitably “adapting” to the ordinal case the 
above mentioned measure of mutual dispersion. 

3 Growing a Tree in the Ordinal Case 

To motivate our proposal, we first of all briefly discuss about the 
concept of heterogeneity, playing a crucial role in CART method- 
ology. 

It is well known that heterogeneity can be measured (at least) 
in two way. Given a variable Y assuming G levels, yi, . . . , yo het- 
erogeneity within a given node, t, can be measured by synthesising 
the (weighted) dissimilarity between each level of Y and a “pole” 
synthesising the distribution of Y : 

G 

sm = £ d{y g ,y t )M9 ) ( 1 ) 

9=1 

where 7T t (g) is the proportion of cases in node t characterised 
by the g-th level of Y. Of course, the choice of the measure of 
dissimilarity d(., .) and of the pole y t will depend upon the nature 
of the response variable. 

Heterogeneity can also be evaluated by referring to the concept 
of mutual dispersion (Gini (1954)), a measure taking into account 
the dissimilarity of each level of Y from all the others and not 
from a synthesis (even if adequate) of Y : 

G G 

«(<) = EE d(y 9 ,yj)n t (g)M3 ) ( 2 ) 

9=1 i=i 

Let us now focus attention on the cases when the response 
variable is nominal or quantitative. In the case when Y is nominal, 
the dissimilarity between two levels of Y can be measured by 
d N (y g ,yj) — 1 if y g = yj, 0 otherwise, and a proper synthesis of 
the distribution of Y is the mode, say y g * . In such situation (1) 
and (2) become: 

G 

S N (t) = 1 - n{g*) and H N (t) = J^7r t (p)[l - Tr t (g)} (3) 

9 = 1 
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Hpf(t) is the so called Gini concentration, the measure of hetero- 
geneity CART procedure is based upon (in the nominal case). 

As concerns the case when Y is a quantitative variable, one 
can choose dQ(y g ,yj ) = (y g — yj) 2 and a possible synthesis of the 
distribution of Y is the mean, y t . It then follows: 

G 

H Q (t) = 2 ^2(y g ~ Vtfn{g) oc S Q {t) (4) 

5=1 

The measure of impurity CART is based upon in the nominal 
and in the quantitative case is (2) - properly extended according 
to the nature of the response variable. Consider a node t, and 
suppose it is partitioned into two subgroups and t R by split s. 
In CART the decrease in impurity achieved when passing from t 
to t L and t R is defined as: 

AH(t, s ) = H(t) - p L H(t L ) - p R H(t R ) (5) 

where p R and p R denote the proportion of cases in tL and in t R . 
In the nominal case, (5) reduces to the well known Gini-Simpson 
criterion-. 



G 

AH N (t, s ) = p L p R ^2[ir t (g\L) - 7r t (sr|R)] 2 (6) 

9=1 

ir t (g\L) and TT t (g\R) being the proportion of cases characterised 
by the g-th level of Y in tL and in t R . In the quantitative case, 
one obtains: 

AH Q (t, s ) oc AS Q {t, s ) = (y tL - y t ) 2 p L + ( Vt R ~ VtfpR 

= PLPR(Vt L - Vt R ) 2 (7) 

Notice that the decrease in impurity in (4) and (6) is given 
by the difference between total heterogeneity and within nodes 
heterogeneity, coinciding with the between nodes heterogeneity. 

Consider now the problem of growing a classification tree when 
the G levels of Y are ordered. In this case, it is natural to require 
the terminal nodes to be constituted by cases characterised by 
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levels of the response variable as much closest as possible. At this 
aim, it is necessary to refer to a measure of impurity within a 
node taking into account the ordinality of Y. Suppose a score, y g , 
can be attached to each level of Y. Let then dos(g,j) = | y g — Vj\ 
and let y t denote the median score within node t. It can be shown 
it is: 

G 

s os (t) = ^\y 9 - vMtig), 

0=1 
G 

Hos(t) = £ l»„+i - fe|F t ( 9 )[l - F,(s)] 

9=1 

F t (g) denoting the proportion of cases in t characterised by a level 
of Y lower than or equal to the g-th. Both measures are invariant 
in location, so that it suffices to define a measure of dissimilarity 
between two adjacent levels. Moreover, if one does not want to 
attach scores to the T-levels, it is possible to set y g = g. In this 
case, (8) becomes: 



( 8 ) 

(9) 



G 

ffoW = £F<(s)[l-F.(9)] (10) 

9=1 

The criterion solely referred to in the ordinal case (see Fabbris 
(1997)) to evaluate a split s splitting node t into the sub-nodes 
ti and tn is based upon the measure of impurity Sos(t), but the 
decrease in impurity is evaluated as: 



AS 0 s(t, s) = | y tL - y t \p L + \yt R ~ Vt\PR (11) 

7 ^ Sos(t) — PL,Sos{tL) — PrSos (J'r) 

In a sense, notice that (10) is obtained by “adapting” criterion (6) 
to the ordinal case, but, differently from (6), can not be viewed 
as an extension of (5). 

It is not the aim of this work to critic criterion ASos(t)\ our 
main point here is that impurity is measured by combining dis- 
similarities between each score and a (proper) synthesis of the 
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distribution, the median. ^From this point of view, this measure 
of impurity is not a measure of mutual dispersion such as the ones 
at the basis of the CART criterion (in the nominal and numerical 
case). Moreover, the decrease in impurity achieved by splitting a 
node into two sub-nodes is not obtained as the difference between 
the total heterogeneity within the node and the heterogeneity 
within the two obtained sub-nodes. 

This is the reason why we suggest to construct classification 
trees measuring heterogeneity by referring to ( 8 ), which is a proper 
extension of the mutual dispersion measure to the ordinal case, 
and defining a criterion to select best splits on the basis of the 
difference between total and within heterogeneity. 

The proposed criterion will constitute a possible alternative 
to the one solely referred to, enriching the class of instruments 
available for generating classification trees in the ordinal case. 

Referring for simplicity to the case when y g = g (but results 
similar to those following hold also if one refers to the case when 
information about the score is available) we have: 

G 

Ho(t) = Y I Ft(s)ll-F,(a)) (12) 

9 = 1 

AHo(t, s ) = Ho(t) — Pl^o(^l) — P rH o(t r) 

= PLPR.YjFML)-F t ( g \R)f (13) 

9=1 

F t (g\L) and F t (g\R) being the proportion of cases characterised 
by a level of Y lower than or equal to the 51 -th in ti and tR. 
The decrease in impurity in (12) is very similar to the criteria in 
(4) and ( 6 ): the so called anti-end-cut factor ( PlPr ) is combined 
with a measure of the dissimilarity between the two distribution 
of Y in ti, and tR, such a measure depending upon the nature 
of the response variable (notice that in the quantitative case the 
difference between the distributions is measured by the distance 
between the means of Y within the two sub-nodes). 
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To appreciate differences between the trees generated by re- 
ferring to criterion ASo and AHo respectively we now consider 
an application to a real data set. 

4 An Application to Real Data 

We consider a data set containing observations on 215 European 
banks. The response variable is the world rank of the bank (in 
1995, source Bankscope)-, more precisely the response variable, Y 
assumes 8 levels, from greatest to lowest in degree (Y = y g if the 
rank is between 1000(<7 — 1) and lOOOg). For each bank, infor- 
mation have been collected about its structure and performance 
(country, number of employees, total assets, funding equity, net 
income etc.). In Figure 1 and 2 the trees obtained with crite- 
ria AHo an d ASo are reported (a darker colour corresponds to 
higher degree). 




Fig. 1 . Tree obtained by referring to the AHo criterion 



In particular, for simplicity, we considered trees with 11 and 
12 terminal nodes respectively, characterised by misclassification 
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Fig. 2. Tree obtained by referring to the ASo criterion 



rates equal to 21.77% and 33.92%. As concerns the tree obtained 
using AHo (Figure 1), it can be noticed that the first split sepa- 
rates the first 4 classes from the others. Moreover, many terminal 
nodes appear strongly characterised by the presence of one class 
(eventually combined with adjacent classes). Considering now the 
tree obtained by applying criterion ASo (Figure 2), we observe 
that terminal nodes appear more confused than in the previous 
case, and not-adjacent classes are put into the same terminal node. 

To better compare the obtained results it is worthwhile to 
consider the confusion tables characterising the classifiers (Tables 
1 and 2). 
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Table 1. Joint distribution of ( Y,Y(AHo )) 



Y\Y(AH 0 ) 


1 


2 


3 


4 


5 


6 




1 


26 


0 


0 


1 


0 


0 


27 


2 


1 


82 


(60) 


13 


(34) 


4 


0 


2 


( 3 ) 


102 


3 


2 


11 


( 5 ) 


61 


(53) 


4 


1 


(6) 


2 


(11) 


81 


4 


0 


10 


( 5 )! 


13 


(12) 


59 (61) 


2 


( 4 ) 


3 


( 5 ) 


87 


5 


0 


0 


2 


(0) 


0 


28 (18) 


0 


(12) 


30 


6 


0 


1 


(0) 


13 


( 9 ) 


0 (6) 


1 


(0) 


53 (53) 


68 




29 


104 


102 


68 


32 


60 


395 



Table 2. Joint distribution of (Y,Y (ASo)) 



Y\Y(AS 0 ) 


1 


2 


3 


4 


5 


6 




1 


21 


3 


0 


0 


2 


1 


27 


2 


6 


87 


0 


4 


2 


3 


102 


3 


7 


36 


17 


7 


8 


6 


81 


4 


3 


12 


0 


62 


4 


6 


87 


5 


0 


0 


0 


0 


18 


12 


30 


6 


0 


4 


1 


7 


0 


56 


68 




37 


142 


18 


80 


34 


84 


395 



In Table 1 the joint distribution of the response variable, Y, 
and the predictor Y(AHo) is represented. In particular, we con- 
sider either the predictor determined by the tree with 11 terminal 
nodes (with rate of misclassification 21.77%) and the one obtained 
by the tree with 6 terminal nodes having a rate of misclassifi- 
cation, 31.392%, more similar to the one characterising the tree 
with 12 terminal nodes obtained by referring to criterion ASo (we 
observed that misclassification rates characterising zASo-classifier 
are greater than those characterising zlifo-classifier for every di- 
mension of the tree). From the above considered tables it seems 
that concordance between the response variable and its predic- 
tor is greater when criterion AHo is used. Standard measures of 
concordance confirm the impression that concordance between Y 
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and Y ( AHq ) is stronger than the concordance between Y and 
Y(ASo) in both the considered cases - 11 or 6 terminal nodes. 

Nevertheless, we are in this case mainly interested in emphasiz- 
ing the structural differences between the two approaches rather 
than in evidencing the better performances of one of them. An in- 
teresting consideration concerns the fact that the AS'o-classifier 
appears to suffer for the absence of the anti-end-cut factor, a fac- 
tor contrasting the creations of one (almost pure) sub-node con- 
taining a small number of individuals and one containing all the 
other individuals (see Taylor and Silverman (1993) for a detailed 
discussion) . The above considerations encourage the using of our 
criterion as a possible valid alternative to the criterion solely used 
to build classification trees in the ordinal case, ASo- 

Needless to say, the proposed AHo will possibly suffer of many 
of the drawbacks which have already been underlined for its nom- 
inal counterpart (see, among the others, Taylor and Silverman 
(1993) and Shih (1999)). Nevertheless, we think it can constitute 
a good “starting point” , which has to be eventually modified tak- 
ing into account and adapting to the ordinal case the proposals 
developed to improve the Gini-Simpson criterion. 

5 A Property of the New Criterion 

In this section we illustrate an interesting property of AHo (char- 
acterising also criterion AHos )• In particular, we will show that 
AHo can be related to an index measuring the proportional re- 
duction of heterogeneity of the response Y due to a predictor. 
This result is not only interesting from an interpretative point of 
view, but also in that it permits to apply the algorithm FAST 
introduced by Mola and Siciliano (1997, 1998) to fasten the pro- 
cess of growing the maximal tree. In particular, let X be a generic 
predictor of Y, assuming / levels; consider a node t, and suppose 
cases in t are divided, according to the levels of X , into / sub- 
groups. If Ho(t) is the heterogeneity of Y at node t and Ho{i\t ) is 
the heterogeneity of Y in the Ath subgroup, having a proportion 
p t (i) of cases, the proportional reduction in heterogeneity of Y 
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due to predictor X is: 



ro(t)(Y\X) 



Ho(t) 



( 14 ) 



This index (Piccarreta, 2001) measures the strength of asso- 
ciation between the (nominal) predictor X and the (ordinal) re- 
sponse variable Y. Consider now a split sx, defined on the basis 
of X, generating the two subgroups and tR. It can be shown 
that: 



rom(Y\s) = 



Ho(t ) — PljHo(tL) — PR.Ho(tR) 
Ho(t) 



AH 0 (t,s) 

Ho(t ) 



(15) 



Hence, the split maximising AHo{t,s ) is the split maximis- 
ing ro(t)(T|s). This result emphasises that our criterion, as the 
Gini-Simpson one (see Mola and Siciliano (1997)), searches for 
the splitting variable predicting the response variable in the best 
possible way (according to the predictability index To). Moreover, 
it can be easily shown that ro(t)(Y\X) > (ro(t)(Y\sx) f° r each 
split sx based upon the levels of X. Following Mola and Siciliano 
(1997), this property can be used to find the best split of a node 
without trying out all possible splits. 

Consider in fact two explanatory variables, say X and Z, and 
suppose it is To(t)(Y\X) > ro(t){Y\Z). Let s* x and s* z be the best 
splitting splits for node t which can be defined on the basis of X 
and Z respectively. If To( t )(F|s^-) — T o(t)(Y\Z), then necessarily 
' r op)(L^l s 3s:) > T~o(t)(Y\s* z ), so that it is not necessary to evaluate 
the predictive power of all the splits which can be defined on 
the basis of Z. The computational gain in terms of the reduced 
number of splits inspected and in terms of CPU time is illustrated 
in Mola and Siciliano (1997, 1998). 



6 Conclusions 

In this work we introduce a new criterion for generating classifi- 
cation trees in the case when the levels of the categorical response 
variable, Y, can be ordered. This is done by referring to a general 
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measure of the mutual dispersion of a variable. Such a measure, 
introduced by Gini (1954), can be applied to every variable, what- 
ever its nature, by suitably defining a measure of the dissimilarity 
between a generic couple of values (or categories) assumed by the 
variable itself. We illustrate that two “natural” choices for the 
dissimilarity between the values of nominal and quantitative vari- 
ables lead to the measures to evaluate a split which are at the 
basis of CART, one of the most popular procedures to build clas- 
sification trees. We thus define an ordinal criterion by referring to 
a suitable measure of dissimilarity between the levels of an ordinal 
categorical variable. The new criterion is compared with the one 
usually referred to in the ordinal case, either from a theoretical 
point of view and with respect to their performance, referring to a 
real data set. The performance of the new algorithm are encourag- 
ing, evidencing the possibility to refer to our as a valid alternative 
to the solely used one. 

The main focus of analysis is the growing phase of CART pro- 
cedure, whose final output is the maximal tree, Tm . With respect 
to this point, we show that our criterion is related to an index 
measuring the proportional reduction of the heterogeneity of the 
response variable in a node when splitting the node itself into two 
sub-nodes. This finding is surely appreciable from an interpreta- 
tive point of view; moreover, it permits to extend to the ordinal 
case an algorithm introduced by Mola and Siciliano (1997, 1998) 
to fasten the procedure of growing T M - 

Of course, the proposed criterion is possibly characterised by 
some of the drawbacks still evidenced especially for its nominal 
counterpart. Our aim is to provide a starting point, which should 
be eventually modified taking into account and adapting to the 
ordinal case proposals developed to improve the Gini-Simpson 
criterion. 
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Abstract. The paper proposes an adjusted maximum likelihood estimator for the 
parametric estimate of a STARG(p, Ao, • • • , A p ) model with measurement noise. 
Provided the noise variance is known or can be estimated consistently, the adjusted 
maximum likelihood estimator is proved to be asymptotically equivalent to the 
corresponding exact maximum likelihood estimator that, in this study, turns out 
to be computationally untractable. The theoretic background outlined in the paper 
finds a natural field of application in observed image sequences. Thus, we present 
the results of a state-space smoothing procedure performed on monthly observations 
over a regular lattice. 



1 Introduction 

The paper is concerned with parameter estimation and smoothing 
of a spatio-temporal series corrupted by noise. In particular, we 
assume that X(s,t) is a spatio-temporal process observed over 
time t and general location s within the geographical domain of 
interest D. That is, we consider X(s,t) at (sj,i) for i = 1,2 , ...,n 
and t = 1, 2, ..., T. In what follows, we use the model 

X(s,t) = Y(s,t) + e(s,t) (1) 

where Y(s,t) is a zero mean, L 2 -continuous (i.e. E{Y ( s + 
h ,t) — Y(s,t)} 2 — > 0 as h — » oo; Cressie 1993, p. 127), second 
order Gaussian stationary process; e(s ,t) is a white noise mea- 
surement process, independent of Y(s,t), with variance a 2 . The 
variable Y (s, t) represents the variable of interest and constitutes 
the state or the signal process. The paper is organised as fol- 
lows: in Section 2 the state process is assumed to be represented 
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by a STARG(p, Ao, • • • , A p ) (Space Time Autoregressive Gener- 
alised) model (Di Giacinto 1994; Terzi 1995) and expression (1) 
will be noted as a STARG+Noise model. Given the assumptions 
for the state and measurement error processes, Section 3 deals 
with estimation of model parameters. Under normality assump- 
tions for the signal and noise processes, maximum likelihood es- 
timators are derived. However the likelihood function turns out 
to be computationally untractable even for samples of moder- 
ate dimension. To overcome this problem, following Dryden et 
al. (2002), we propose an adjusted likelihood function. The ad- 
justed maximum likelihood estimator ( AMLE-ST ) is proved to be 
asymptotically equivalent to the maximum likelihood (ML) esti- 
mators, provided the noise variance is known or can be estimated 
consistently. Apart from the computational feasibility, AMLE-ST 
have the appeal of not requiring the specification of the entire 
distribution function of the noise process, but only of the first 
two moments. In Section 3.1 a simulation study is carried out to 
asses the performance of the estimator. In this case we have se- 
lected some example situations to give a flavour of the types of 
behaviour that our estimator exhibit under particular conditions. 
Section 4 shows an application to image sequence analysis. The 
data set is provided by the Goddard Distributed Archive Center 
(GDAC), and consists of Earth images derived through Normal- 
ized Difference Vegetation Index (NDVI) to study the vegetation 
dynamics around the globe. At last, Section 5 closes the paper 
with some final considerations on AMLE-ST. 

2 The STARG+Noise Model 

Let us assume that the unobserved state processes Y(s,t), apart 
from a deterministic mean component ji{ s, t), evolves in a Markov 
manner so that only recent observation values influence the cur- 
rent time value. The large scale variation, measured by the mean 
component +s,f), is assumed to be represented by a parametric 
spatial trend /( s; f3 t ) of general functional form and possibly time 
varying parameters (5 t . Under this assumption, using vector nota- 
tion, we model the spatio-temporal dependence by means of the 
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STARG(p, A 0 , • • • , A p ) (Space Time Autoregressive Generalised) 
model (Di Giacinto 1994; Terzi 1995) 

y (-.*) = EE ^W(‘)Y(.,i-/i)+u(.,l) (2) 

h = 0 k=Q 

where Y (•, f) and u(-, t) are (nxl) vectors corresponding to the n 
sites, p is the temporal order of the model, A h is the spatial order 
of the h — th autoregressive component, <f>hk is the auto-regressive 
parameter at temporal lag h and spatial lag k (</>oo = 0), W(') is 
the (n x n) spatial contiguity matrix (Pfeifer and Deutsch, 1980) 
and u(-, t) is a homoskedastic white-noise term with variance <7^1. 
Note that if isotropy is not reasonable, an anisotropic process can 
be used allowing the coefficient 4>hk varying also with direction 
and orientation. Setting 



Ao 

A 0 = I - 

k = o 
A h 

A h = J2 ( t , M wik) ’ forh = l,2,...,p 

k = 0 

we rewrite the STARG model in its vector autoregressive {VAR) 
form 



A 0 Y (•, t) - AiY(-, t - 1) - ... - A p Y(-,t -p)= u(-, t). (3) 

to obtain the state-space representation of the STARG-Noise model 
(Ippoliti et al., 1998). In fact, model (3) can also be represented 
in a more compact notation through the following expressions 



£M) = *?M-i) + u(-,t) (4) 

where £(-,t) is the state vector, $ is the transition matrix and 
U(-,t) is the model noise; their structure is as follows 
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Y M) 

Y(-,t-p + l) 



Ai A 2 * • * Ap_^ A p 
I 0 ••• 0 0 



0 0 •• I 0 



1) 



Y(-,t-p) 



U (;t) 



u(-,t) 

0 

0 



Equation (4) is the state equation of the state-space form of 
model (3) while the corresponding measurement equation is given 

by 



X(;t)=H£(;t) + e(;t) (5) 



where H = [A 0 1 0 • • • 0] is the measurement matrix. 

Given the state space formulation, it is straightforward to ap- 
ply the Kalman filter which, computing the optimal recursive es- 
timation of the state vector at time t, leads to the implementation 
of algorithms for filtering and smoothing. 
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3 Maximum Likelihood Estimation 

The Kalman filter can be performed only when all the parameters 
of the state space model are known. To this purpose, we present 
here a maximum likelihood estimator of such parameters. If the 
initial values of the process are equal to their unconditional mean, 
the STARG model can also be written as follows 

n( Y - //) = U ( 6 ) 

p 

with II = (I T <S> A 0 ) — J2 ( Bh ® A h ) (where the T x T matrix 

h= 1 

Bh has all zero entries except for the h — th lower diagonal, with 
elements equal to one), Y=vec(Y), Y = (Y(-,f), t = 1, • • • ,T}, 
u=wec(u), u = {u(-,f), t = 1, • • • ,T}. Assuming a polynomial 
approximation for the functional form /( s; /3 f ) of the deterministic 
spatial trend, we can set 



/i(-,f) = D/? t (7) 

where /3 t is an r-dimensional vector of coefficients and D is an 
n x r suitable design matrix. As a consequence we have 



A — D/3 (8) 

where D = (It ® D) and /3=vec(f3), f3 = {/3 t , t = I,-- - ,T}. 
When the trend coefficients are fixed over time, i.e. when fii = 
/?2 = ... = Pt — expression (8) simplifies to ji = D/3, where 
now D = {i t ® D). 

In this case, the Gaussian log-likelihood function of the STARG 
process has a remarkably simple expression and can be directly 
obtained as 



Ly(y ; 0|Y O ) = 



^log(^)+Tlog|A 0 



(y - ji) T n T n{Y - p) 

2a 2 



(9) 

where Y 0 = [Y (•, 0), Y(-, —1), • • • , Y(-, — p + 1)] and 9 is the vec- 
tor of all the parameters to be estimated. However, since the state 
process is not observable, using (1) and following Dryden et al. 
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(2002), it is possible to work with the log-likelihood of the signal 
by means of the observed series X(-,£) 

•by (X; 0|X O ) — 

nT , . 2 , 1A . (X - E - ji) T n T n(x - E - p) 

— log (*l) + Tlog |A 0 | - ^ w 2g2 — 

79 'T 1 1 ^ 

= — y log(aj) + T log |A„| - ^j(X - fn T I7(X -£)- 

-^ TnT n(±-fi)-^(¥7n T nv) (io) 

where X = vec(X), X = {X(-, t), t = 1, • • • , T}, E = vec(e), e = 
{e(-, t), t = 1, - ■ ■ , T}. Since E[E T i7 T i7(X — //)] = a 2 e tr(n T Il), 
we obtain the following Adjusted Maximum Likelihood Estimator 
(. AMLE-ST ) 



M- 10 - — ^-iog(^)+ T1 °gl A o 



[X r f7 T fIX - 2 cr 2 e tr(n T II)] 
2(7 « 



(n) 

where X = (X — p) and tr the trace operator. Note that the 
estimator is weakly consistent and asymptotically normal (see, 
Dryden et al, 2002). From the first order conditions, a closed form 
solution can be obtained for the trend coefficients and innovation 
variance a 2 



p = ( D T n T nb)- l D T n T n 5 t (12) 



31 = ~ [(x - ti T n T n(x - ji) - 2altr(n T n)} (13) 

and to facilitate the iterative search needed to compute the ML 
estimates of the autoregressive coefficients, a concentrated log- 
likelihood function can be derived substituting (13) in (11) 



L y (X,-|-) = Tlog|A 0 |- 

— "y log [(x - pfn T n(x -%)- 2 a 2 e tr(n T n)} (14) 

where jl = D/3. Since Jd is a non linear function of the autore- 
gressive coefficients </>, maximization of (14) is not an easy task. 
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However it can be greatly facilitated by the adoption of the fol- 
lowing stepwise optimisation procedure: 

1 . a preliminary estimate of /3 is computed setting iJ = I n r in 
(12), i.e. taking the OLS estimates of the regression of X on 
D; given the deterministic nature of D the OLS estimators 
are consistent and thus provide a valid and easily computed 
starting point; 

2. jl 0 = D/3o is substituted for /I in (14) and this is maximized 
through an iterative search algorithm (the Newton-Raphson 
iterative procedure has been successfully implemented to this 
purpose) to yield the estimates 0 O of 0; 

3. based on the estimated 0o the matrix ilo is derived and the 

GLS estimates are computed from (12); 

4. steps 2 and 3 are iterated until convergence is achieved. 

If the measurement noise variance a\ is assumed to be known 
a priori, equation (12-14) can be directly maximised to estimate 
the remaining parameters. On the other hand, if a\ is not known, 
a consistent estimator or a measurement nugget effect (Cressie, 
1993) can be used. 

3.1 Simulation study 

In order to verify the performance of the estimator we have se- 
lected some example situations. In each simulation 100 samples 
were generated from a STARG(l,ll)+Noise model with param- 
eters: 0 oi = 0.5; 0xo = — 0.4; 0ii = 0.2 and al = 1. To study 
the behavior of our estimator we performed different simulations 
varying both n and T. For all the simulations, the noise variance 
parameter of has been fixed to produce a Signal-to-Noise-Ratio 
(SNR, i.e. the ratio of the signal variance and the noise variance) 
equal to 5 dB. Even if several methods can be carried out to es- 
timate this parameter, in each simulation the measurement noise 
variance was fixed equal to its true value. The results of this 
simulation study are shown in Table 1-2 where the standard errors 
are reported in brackets. 
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Table 1. The means (and standard errors) of the parameter estimates from 100 sim- 
ulations of a STARG (1,11)+ Noise model. The true parameters are 0 oi = 0.5; 0io = 
—0.4; 0n = 0.2 and cr^ = 1. 



Times 


30 


60 


Lattice Size 


001 010 011 &u 


001 010 011 


n—1 6x 1 6 


0.504 -0.404 0.208 0.981 

(0.012) (0.011) (0.019) (0.028) 


0.504 -0.401 0.204 0.984 

(0.008) (0.008) (0.013) (0.016) 


n=32x32 


0.503 -0.401 0.205 0.984 

(0.006) (0.005) (0.010) (0.010) 


0.503 -0.402 0.205 0.986 

(0.004) (0.004) (0.007) (0.008) 


71=64*64 


0.505 -0.403 0.206 0.984 

(0.003) (0.003) (0.Q05) (0.006) 


0.504 -0.401 0.203 0.982 

(0.002) (0.002) (0.003) (0.004) 



Table 2. The means (and standard errors) of the parameter estimates from 100 sim- 
ulations of a STARG (l,ll)+Noise model. The true parameters are 0 oi = 0.5; 0io = 
—0.4; 0ii = 0.2 and = 1. 



Times 


100 


Lattice Size 


001 010 011 &u 


n=16xl6 


0.504 -0.402 0.205 0.983 

(0.006) (0.006) (0.010) (0.012) 


71=32x32 


0.504 -0.401 0.202 0.986 

(0.003) (0.002) (0.005) (0.006) 


71 = 64*64 


0.504 -0.401 0.20'3 0.981 

(0.002) (0.001) (0.003) (0.003) 




Fig. 1. NDVI data set 
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As it can be seen from the tables, the fundamental properties 
of the maximum likelihood estimators are respected. As expected, 
being a weakly consistent estimator, AMLE-ST improves its per- 
formance increasing both time and lattice size. Similar finding 
were obtained in a variety of other simulations with other param- 
eters and spatial neighbourhood structures. 
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4 Smoothing of Noisy Image Sequences 

The theoretic background outlined in previous sections finds a 
natural field of application in observed image sequences. Here 
we present the results of a state-space smoothing procedure per- 
formed on monthly observations over a regular lattice. 

The data set is provided by the Goddard Distributed Archive 
Center (GDAC), and consists of Earth images derived through 
Normalized Difference Vegetation Index (NDVI) to study the veg- 
etation dynamics around the globe. For a more detailed descrip- 
tion of the data see (Los et al., 1994). We considered a period of 11 
years (January 1982 - December 1992) for a total of 132 temporal 
observations, each consisting of a 64 x 64 image from the Middle- 
East of North America. Figure 1 is a spatial-temporal grey scale 
image plot of the whole data set where the N x M = n sites have 
been arranged in a lexicographic order (i.e. in a raster scan man- 
ner). The values range from 0 (for dense vegetation zones) to 255 
(for land zones) and the vertical stripes confirm the presence of 
a temporal seasonality highlighted by the temporal data analysis 
of the single time series. 

To create a zero mean data set we computed a time varying 
three-parameter plane surface which also allowed for removing 
the temporal seasonality. Furthermore, although the data were 
already been treated by the GDAC a measurement noise can still 
be observed. The estimation of the measurement noise variance of 
can be carried out through several methods. Huang and Cressie 
(2000) estimate of as a nugget effect in wavelet decomposition 
while Gauch (1982) uses eigenvector ordination techniques for 
field data. Olsen (1993) provides a wide range of methods on this 
topic. With respect to a single image, reasonables values for of can 
also be obtained by computing the variance of a flat portion of the 
image. In this study, following Lim (1990), we employed a pixel- 
wise adaptive Wiener filtering based^on a 5 x 5 local neighborhood 
of each pixel obtaining an estimate of = 76.91. This procedure can 
be easily performed by means of the wiener2.m function of Matlab 
Image Toolbox. Note that we obtained a very similar result de- 
tecting of as the nugget effect of a spherical covariogram (Cressie, 
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Fig. 3. NDVI estimated measurement error for year 1990 



1993). Given the measurement noise variance, the state-space for- 
mulation, for different orders of the STARG models, was then used 
to fit the data and calculate the AIC statistics (Akaike, 1974) to 
choose the best model. Thus, using this information criteria, the 
STARG(1,11) model with parameters (standard errors in brack- 
ets): <Si - 0.809(0.001), <So = 0.254(0. 001), = -0.157(0.002), 
crA= 112.51(0.005) was chosen as the best single model as the ba- 
sis for inference. To give a flavour of the goodness of the smoothing 
procedure Figures 2(a) and 2(b), present, respectively, the original 
and filtered data, while Figure 3 shows an estimate of the obser- 
vational error for the year 1990. To test the goodness of these 
results it has also been found that smoothing residuals show a 
covariance pattern represented by a nugget variogram and do not 
exhibit a spatial structure. 

5 Conclusions 

Provided a good initial estimate of the noise variance can be given, 
the adjusted maximum likelihood estimator appears to give excel- 
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lent results in a variety of situations for noisy STARG models. A 
point to highlight is that AMLE-ST depends on the first two mo- 
ments of the noise distribution, and so would be appropriate for 
general noise distributions (Dryden et al . ; 2002). The procedure is 
reliant on large sample sizes for the consistency arguments in the 
adjustments to be valid. However, for a sequence of lattice data 
(such as those considered in Section 4) we usually do have a huge 
data set. Of course a more deep (theoretic and simulation) study 
is needed to completely define the statistical property of AMLE- 
ST. To this purpose, this estimator should also be compared with 
respect to the corresponding exact maximum likelihood estima- 
tor (available only for Gaussian measurement noise). All these 
points, together with the study of the effect of different boundary 
conditions, will be considered in a future work. 
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Abstract. In this paper the orthogonal decomposition is used in order to recon- 
struct the noiseless component of a temporal stochastic process. For weakly sta- 
tionary processes, the proposed methodology is based on the joint application of 
the spectral analysis in the frequency domain (Fourier analysis) and in the time 
domain (Karhunen Loeve expansion). For non stationary processes the orthogonal 
decomposition is realized in the wavelet domain. 



1 Time Series Decomposition and Stochastic 
Process Expansion 

The purpose of this paper is to discuss issues associated with 
the problem of signal extraction. Using the general orthogonal 
decomposition of a stochastic process, the object is to analyze 
the structure of a signal from a set of observations, which occur 
with error. In particular, assuming that {X(t) : t € [0, T]} is a 
stochastic process with continuous parameter in the interval [0, T], 
the measurement equation is given as the sum of two independent 
components 



X(t) = L(t) + A(t) (1) 

where Lit) is the signal and A{t) the noise. 

If X ( t ) is a finite energy process, it is possible to expand it in a set 
of deterministic functions <p n ,n 6 which forms a complete or- 
thogonal system in the space of square integrable functions Li(T) 

X{t) = ^2 a n4>n(t) 



( 2 ) 
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where the stochastic coefficients a n are given by the following 
inner product 



a n = J X(t)<j> n (t)dt n e (3) 

The paper is organized as follows. In Section 2 second order 
stationary processes are considered and equations (2) and (3) are 
specified in frequency and time domains. As explained in Section 
2.1, we propose a joint application of spectral analysis in these 
two domains in order to reconstruct the noiseless component L(t) 
of equation (1). For covariance stationary processes, the proposed 
methodology also leads to the detection of the harmonic compo- 
nents of the signal. To this purpose, an example, based on the 
identification of the structural components of the monthly time 
series of energy consumption in Italy, is presented in Section 2.2. 
In Section 3, in the context of wavelet analysis, we consider the 
denoising problem for covariance non-stationary processes. The 
methodology is illustrated in Section 3.1 while Section 4 concludes 
the paper with an application to the daily time series of Dioxide 
Sulphur concentration in the Leeds District. 

2 Second Order Stationary Processes: Fourier 
and Karhunen Loeve Expansion 

Fourier decomposition, in the frequency domain, and Karhunen 
Loeve expansion, in the time domain, are special cases of equa- 
tion (2). In the Fourier analysis the basis functions are frequency 
depending, while in the Karhunen Loeve approach they are time 
depending. 

In the Fourier representation the process X ( t ) is expressed as 
a special form of a Fourier integral 

X(t) = (V^Y 1 r G(w)e itw dw t e [-T, T] (4) 
' ' J — oo 

In (4) the coefficients G(w ) represent the weights of each basis 
function eu w in the process reconstruction, and are obtained by 
the Fourier transform of the process X ( t ) 
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G(w) = (V2n) 1 [ X(t)e~ itw dw (5) 

' ' J -00 

for which 

t-^L i E [l G H| 2 / 2T ]} = h{w) (6) 

In (6) h(w) is the non-normalized spectral density function 
of X ( t ) , obtained as Fourier transform of the covariance function 
C(t,s) 



fvTr'j f C(t,s)e ltw dt — h{w ) (7) 

On the other hand, in the time domain the Karhunen Loeve 
expansion allows to represent a zero mean random process X(t) 
as a series of orthogonal functions with uncorrelated coefficients 

OO 

V« = X>A(t) *e[0,T] (8) 

so that 

T 

X(t)9j(t)dt j = 1, 2, ... (9) 

The functions 6j(t) are the eigenfunctions of the covariance 
function of the process and A j are the corresponding eigenvalues 

T 

C(t, s)Oj(t)dt = A jOj(s) j = 1, 2, ... (10) 

Since the process X(t) is zero mean, the Zj are uncorrelated 
random variables with mean zero and variance A j. Equation (8) 
is called Karhunen Loeve expansion (KLE), though equation (9) 
is known as Karhunen Loeve transform (KLT). As we consider 
a realization x(t) of the process for n times, t = 1, ...,n, a finite 
approximation for equations (8), (9), (10) is required. In particular 
equation (10) is now 
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^C(t,s)9 j (t) = \ j e j (s) j = 1 , 2 ,... (11) 

4= 1 

or in matrix form C 9j= A j 0j where C is the covariance ma- 
trix of the series. From (11) it follows that the estimators of the 
eigenfunctions and the eigenvalues of the covariance function are, 
respectively, the eigenvectors and the eigenroots of matrix C. In 
the finite approximation KLE and KLT are respectively 

n 

X(t) = Y d *Mi) ( 12 ) 

3 = 1 
n 

Zj = ^0j{t)X(t) j = 1, 2, ..., n (13) 

4=1 

The KLE has the property that, if the eigenfunctions are ar- 
ranged according to non-increasing order of the eigenvalues, the 
approximation based on the first p terms minimizes the mean 
square error (Basilevsky, 1994). 

2.1 Detection of Structural Components 

For a weakly stationary process, the denoising problem could be 
solved by means of a linear approximation in the time domain 
(Mallat, 1998). It is known that the eigenvectors corresponding 
to the greatest eigenvalues contain information about the tempo- 
ral correlation and, hence, on the signal L(t). On the other hand, 
the eigenvectors associated to the smallest eigenvalues are related 
to the noise component A(t) (Mardia et a/., 1998). Several tech- 
niques have been proposed to choose the number of eigenvectors 
to be considered in the reconstruction formula (12) and most of 
them are described in Mardia et al. (1994). In this paper the es- 
timation of the signal is based on the preliminary identification 
and reconstruction of its additive components. In fact, the sta- 
tionary signal process can always be decomposed into elementary 
harmonic waves, which might be aggregated into structural com- 
ponents, as summarized in equation (14) 
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L(t ) = C(t) + S(t) (14) 

where C{t) is a cycle with variable amplitude and S(t) is a sea- 
sonal component (Harvey, 1989). 

In order to identify and estimate the structural components 
of a second order stationary process X(t), we propose the joint 
employment of spectral analysis and Karhunen Loeve expansion. 
The joint application of the above methodologies requires the use 
of periodogram analysis as a tool to identify the structural compo- 
nents and to test their reconstruction, obtained by the Karhunen 
Loeve technique. Given a time series x(t), in order to obtain a 
zero mean stationary series we need to remove the trend compo- 
nent, if it exists. The detection of the presence of the remaining 
structural components in the detrended series y(t) is realized by 
means of the periodogram. The peaks of the periodogram show 
the presence of cyclical components with different frequencies. 
The significance of these peaks is checked by Whittle’s and Hart- 
ley’s tests (Priestley, 1981). The reconstruction of the structural 
components of the series is obtained by KLE. In (12), with respect 
to each harmonic component, shown by the periodogram, a linear 
combination of a subset of eigenvectors of the covariance matrix 
must be considered. The eigenvectors 6j(t) with similar temporal 
pattern are strongly related to the same basic constituent and 
then may be aggregated in (12). According to Basilevsky (1994), 
these eigenvectors correspond to insignificantly different eigenval- 
ues and can be identified by the following test on the equality of 
the corresponding r eigenroots 



r [ r \ 

X 2 = -nJ2 ln pi) + nr In I ^ (%/r) ) (15) 

j= i \i=i / 

This test has a Chi-square distribution with (|r(r -I- 1) — l) 
degrees of freedom. The fit of each structural component is re- 
liable if the estimated spectral density function has just a peak 
at the frequency of the harmonic component that we attempt to 
reconstruct. To test the goodness of the signal reconstruction we 
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estimate the spectral density function of the residual component, 
which must be similar to the spectrum of a white noise. 

2.2 Application to the Monthly Energy Consumption 
in Italy (1978-1995) 

The proposed methodology has been applied to the monthly en- 
ergy consumption in Italy from 1978 to 1995 ( Source : ENEL). In 
order to obtain a mean stationary time series, a linear trend has 
been removed (Figure l.a). Whittle’s and Hartley’s tests on the 
periodogram peaks (Tables 1 and 2) confirm the presence of har- 
monic components in the series. 



Table 1. Whittle’s Test on the Periodogram Peaks 



— 

Frequencies 


Test 

Statistics 


Critical 
value 1% 


Frequencies 


Test 

Statistic 


Critical 
value 1% 


0.254 


0.298 


0.083 


1.047 


0.263 


0.087 


3.142 


0.386 


0.084 


0.058 


0.335 


0.087 


2.094 


0.287 


0.084 


2.618 


0.299 


0.088 


1.571 


0.308 


0.085 


0.087 


0.118 


0.089 


0.029 


0.231 


0.086 


0.116 


0.161 


0.090 



Table 2. Hartley’s Test on 

the Periodogram Peaks 



Test Statistics 


Critical value 1% 


2.6446 


1.999 



The significant frequencies highlight the occurrence of cyclical 
and seasonal components. The eigenvalues scree plot (Figure l.b), 
the Chi square test (15) and the eigenvectors periodogram sim- 
ilarities let the identification of the eigenvectors subsets for the 
estimation of nine harmonic components (Table 3). As stressed by 
the peaks of the estimated signal periodogram (Figure l.e), the 
reconstructed detrended series (Figure l.c) preserves the same 
spectral contents of the original one. The goodness of the re- 
construction methodology is supported by the estimated noise 
periodogram (Figure l.f), whose behavior is similar to the spec- 
trum of a white noise as confirmed by Whittle and Hartley’s tests. 
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a) Monthly energy consumption in Italy 
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Fig. 1 . Monthly energy consumption in Italy (1978-1995). 



Table 3 Chi square Test on the eigenvectors 



Frequencies 


Period 
in months 


Basic 

components 


Eigenvectors 

subsets 


Explained 
variance % 


Test 

Statistics 


Critical 
value 1% 


0.5236 


12 


1 st : S{t) 




35 


0.019 


9.210 


3.1416 


2 


2 nd : S(t) 


3 rd 


17.9 


- 


- 


2.0944 


3 


3 rd : S{t) 


4 th - 5th 


17.4 


0.005 


9.210 


1.5708 


4 


4 th : S(i) 


rjth 


12.2 


0.006 


9.210 


1.0472 


6 


5 th : S{t) 




5.3 


0.008 


9.210 


0.0291 


216-108 


Qth _ Jth . 




4.5 


0.001 


9.210 


2.618 


2.5 


8 th : S(t) 


12 th - 13 th 


2.6 


0.010 


9.210 


0.0873 


72 


9 th : C(t) 


\4. th 


1.2 




- 



3 Non Stationary Time Processes: Wavelet 
Expansion 

The series expansion (2) of the process in an orthogonal basis 
can be realized also in the wavelet domain (Mallat 1998). Since 
the structure of the wavelet basis functions provides specificity 
in location (via translations) and in frequency (via dilations), 
this methodology allows a simultaneous analysis of the spectral 
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contents of the process both in frequency and time domain. The 
wavelet transform offers a localized frequency decomposition and 
provides information not only on what frequency components are 
present on the signal, but also when or where they are occurring. 
Therefore wavelets have significant advantages over basic Fourier 
analysis, when the process under study is non-stationary. 

In this paper we focus our attention only on the Discrete Wavelet 
Transform (DWT). In this case the construction of the basis of the 
Hilbert space Z^T) is obtained by translating the father wavelet 
V? (scaling function) and by translating and dilating the mother 
wavelet ip (analyzing function). Consequently the orthonormal ba- 
sis, for some fixed Jo € Z, is given by 

<Pj 0 ,k(t) = 2 Jo/2 (f (2 J0 t - k) (16) 

= 2 j/2 ip (2 j t - k) (17) 

In the wavelet basis, equation (2) can be expressed as 

Mt) = £ EE ( 18 ) 

k€Z j>j 0 keZ 

where the scaling Cj 0tk and the wavelet dj t k coefficients are 
given by the following wavelet transforms 

Cj ° ,k = ^ 

dj,k = j X(t)ip j<k (t)dt (20) 

3.1 Detection of the Noiseless Component in Wavelet 
Domain 

For the time-frequency localization property of the wavelet rep- 
resentation, the identification and reconstruction of the signal 
component of a non stationary time series can be realized in the 
wavelet domain. 

Let x = {x(fi), x(t 2 ), , x(t„)} be a time series of length n = 2 J , 
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equally spaced in the time interval [0, 1]. The coefficients obtained 
by the wavelet transforms (19) and (20) allow for representing the 
time series in a time-frequency domain. For a given level of ap- 
proximation jo, the vector x is wavelet transformed into a new 
vector w of the same length. More specifically, given jo = 0, we 
can compute one scaling coefficient co,o and (2 J — l) detail coef- 
ficients d j>k ,j = 0, 1, ..., J - 1, k = 0, ...,2 j - 1 (w = (co l0 ,d)). 

The transformation from the temporal to the wavelet domain 
can be expressed in matrix form as w=Wx, where W is an or- 
thogonal matrix associated to the orthonormal basis chosen. Since 
the observed data can be considered as a signal L(t), corrupted 
by a white noise A(t), as in equation (1), the wavelet coefficients 
dj t k can be also written as the sum of a signal and an erratic com- 
ponent, dj t k = d* k + T]j t h ■ With the aim to reconstruct the signal, 
we need to remove the noise component from each coefficient. A 
technique to achieve free-noise coefficients is wavelet shrinkage 
(Donoho and Johnstone 1994-1995). Under this scheme ’’hard” 
or ’’soft” thresholded coefficients dj k = 5/i(dj,fc) can be obtained 
respectively from (21) and (22) 



$A{dj,k) — dj tk I{\dj y k\ > A) (21) 

$A(dj,k) = sign(d j>k )max( 0 , 1^1 - A) (22) 

Various approaches suggest different choice of the threshold A 
(for a review see Antoniadis, 1995). In this paper, for simplicity, 
we use the universal threshold A un = OA^/Zlogn (Donoho and 
Johnstone, 1994) where the estimator of the noise standard error 
a a can be the median absolute deviation (MAD) of the detail 
coefficients at the highest level 



a a = MAD{dj fc} = median {\dj~i k — median {dj_i k } |} /0.6745 

(23) 

The inverse wavelet transform (18) applied to the thresholded 
detail coefficients yields to the reconstruction of the signal com- 
ponent 
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J - 1 2^ — 1 

L(t) = c 0fi ipO, 0 (t) + X X 

j=0 /c=0 



(24) 



The series components at different frequencies are obtained by 
applying the reconstruction formula (18) for each level of resolu- 
tion 



2 J — 1 
k = o 



(25) 



The analysis of each basic component can suggest where the 
different frequencies are localized in time. 



a) Series 




Fig. 2. SO 2 daily mean concentration in the Leeds District (29/03/83-21/08/84): 
Series, Estimated Signal and Estimated Noise. 
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4 Application to the Daily Concentration 
of Dioxide Sulphur ( SO 2 ) in the Leeds 
District 

The wavelet decomposition has been applied to the daily time 
series of SO 2 mean concentration in the Leeds district (29/03/83- 
21/08/84) (Mardiaet ah, 1998). 

In the analysis we use a Daubechies with 8 vanishing moments 
with periodic symmetric boundary condition (Mallat, 1998). The 
second order non-stationarity of the series (Figure 2. a) is con- 
firmed by the White’s heteroskedasticity test (White, 1980). Al- 
though the non-stationarity in covariance, the analysis in the 
wavelet domain allows for separating the signal from the additive 
noise component. Using hard thresholded coefficients the signal 
component is estimated by means of (24). In Figure 2.b it is ev- 
ident that the reconstructed signal is able to track the raw data. 
The goodness of the signal reconstruction is highlighted by the 
similarity of the periodograms of the estimated signal and the 
series (Figure 3. a). Figure 3.b shows the randomness of the esti- 
mated noise, confirmed by Whittle’s and Hartley’s tests on the 
periodogram peaks. 
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Fig. 3. SO 2 daily mean concentration in the Leeds District (29/03/83-21/08/84): 
Series, Estimated Signal and Estimated Noise Periodograms. 
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Fig. 4. SO 2 daily mean concentration in the Leeds District (29/03/83-21/08/84): 
Detail Components at different resolution levels. 



The signal reconstruction for each resolution level (Figure 4) 
emphasizes the wavelet localization property. The lower resolution 
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levels are associated to basic components of small frequencies con- 
tent and of large spread on the time axis. For higher resolution 
levels the components of higher frequencies are better localized in 
time. 
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Abstract. The stationarity of time series is often reached through the transfor- 
mation of the observed data. When the analysis of the series is carried out auto- 
matically using implemented softwares, it is needed to define some indicators which 
alerts the system about the non stationarity of the data and leads to right trans- 
formations. In this context, the present paper proposes an indicator which detects 
the heteroskedasticity of the data and its empirical distribution has been investi- 
gated through Monte Carlo simulations. The performance of the indicator has been 
compared to well know homoskedasticity test usually implemented in statistical 
softwares. 



1 Introduction 

The seminal work of Box and Jenkins (1976) has given a system- 
atic way to analyze time series. It is based on several steps (spec- 
ification, estimation, diagnostic checking and forecasting) which 
often imply complex analysis and requires a deep knowledge of the 
theoretical background. In order to make the method available to 
people without expertise, several softwares have been developed to 
automate the Box and Jenkins procedure and to simplify the anal- 
ysis of large number of time series. This context includes the TESS 
System (http:/ / esl.jrc.it/tess/) implemented within the Esprit IV 
29.741 project supported by the Eurostat. This software has been 
recently submitted to an assessment study based on Monte Carlo 

* The present paper is partially supported by MURST 2000: Modelli stocastici e 
metodi di simulazione per Vanalisi di dati dipendenti. It is a joint work of both 
authors, anyway Niglio wrote sections 1 and 3, while Pagnotta sections 2 and 4. 
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simulations to evaluate its performance and the ability of the Sys- 
tem to extract the trend, the seasonal component and to generate 
forecasts (Giordano et ah, 2000). The analysis of the TESS System 
starts evaluating the presence of desirable properties in the time 
series under study such as mean and variance stationarity which 
have great relevance in model identification and parameters esti- 
mation respectively. These properties can be reach carrying out 
suitable preadjustments of the data which are selected, in a set of 
different transformations, according to the value of some indica- 
tors evaluated from the System. In the present paper a procedure 
to investigate the heteroskedasticity of the time series is proposed. 
It is based on two smoothers estimated through the quantiles ob- 
tained dividing the data set in subseries of fixed length. An index 
which alerts the analyst to the presence of heteroskedasticity is 
derived. A Monte Carlo simulation has been implemented in order 
to test its performances and to discover its empirical distribution. 
The proposed procedure is finally compared with the Goldfeld 
and Quandt test (1965), widely implemented in statistical soft- 
ware for time series analysis. The paper is organized as follows: 
the theoretical background is shown in section 2; the results of 
a Monte Carlo experiment and the proposed detection rule are 
presented in section 3, while some concluding remarks and open 
problems are in the final section. 

2 The Theoretical Background 

The classical analysis of time series assumes that the series X t , t = 
l,...,iV, can be decomposed in four orthogonal unobservable 
components: the trend T t , the cycle C t , the seasonal 5), and the 
irregular component U t - These components can be differently com- 
bined providing time series with different features. 

In the following we assume that the time series X t has an 
additive structure with a constant cycle such that: 



X t = T t + S t + U t , (1) 

where the trend is linear, the seasonal component has constant 
amplitude and phase and the irregular component is an ARMA(p,q) 
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process. These requirements provides an homoskedastic time se- 
ries model. 

Our aim is to derive a synthetic measure which highlights pos- 
sible changes in variability of the time series. The construction of 
this measure is based on two steps: first we consider two resistant 
smoothers, and then the smoothed values, obtained from each of 
them, are regressed over time with a simple regression model. The 
difference p of the estimate of the two slopes is our index of inter- 
est. Our conjecture is that when the time series is homoskedastic 
rj & 0, while in presence of heteroskedasticity 77 ^ 0 and its value 
is a function of the overall variability. 

Given an integer k « n, for each t = k + 1, k + 2, ...,n — k 
we consider a window: 



W t — {X t - k , X t -k+i, ■ ■ ■ , X t -i,X t , X t +i, • • • , X t +k } (2) 

where the number of elements in W t is 2k + 1, while the number 
of all windows is n — 2k. 

For each W t we compute the smoothers: 

Qiit) = 0.25 sample quantile of W t 
Qs(t) = 0.75 sample quantile of Wt . (3) 

Let and the OLS estimates of the two slopes of the 
linear regressions: 

Qi {k T t) = a^ T b^t T t — 1, 2, . . . , n — 2k 
Qzik + t) = a^ + b^t + 77 , t — 1 , 2 , . . . , n — 2k. (4) 

The slope b^ can be written as weighted mean of the smoothed 
values Qi(t) by using standard results of simple linear regression 
models: 



n—2k 

b^ = d t Qi(k + t) and d t = 
t = 1 




n— 2fc+l 

2 

{n— 2/c+l) \ ^ 

2 ) 



i = 1,3 



( 5 ) 
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Our proposed measure to evaluate changes in variability is 
then 



n— 2k 

V = b^ — ^ ' dt(Qz(k + t) — Qi(k + t )). 

t = 1 




Fig. 1 . Airline Passenger time series (solid line) with the resistant smoothers Qi(t) 
and Qz(t) (sketched line) and the lines n and (dotted line) 



We can show, by using standard geometric interpretation on slopes 
and goniometric formula, that 77 is proportional to the goniomet- 
ric tangent of the angle Ad = #3 — # 1 , where Oi, i = 1,3 is the 
angle among the time-axis and the straight lines r\ and r 2 whose 
equation are the linear regression models in (4) respectively. 

Our conjecture can now be graphically presented. Figures 1 
shows the well known Airline passengers time series data (solid 
line) together with the smoothers (sketched line) and the regres- 
sion lines (dotted line). The regression lines diverge as the time 
increases and then the value of 77 is 0.5306. 

In Figure 2 the same time series has been previously log- 
transformed and then the smoothers and the regression lines have 
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been estimated. In this case the regression lines are almost parallel 
providing a value of r\ — 0.00. 




Fig. 2. Log of the Airline Passenger time series (solid line) with the resistant 
smoothers Qi{t) and Qz{t) (sketched line) and the lines r\ and r 3 (dotted line) 



The proposition we present assures that under the hypotheses 
stated above the value of 77 has to be zero. 

Proposition 1. Let X t , t = 1,2, ... ,n be an homoskedastic time 
series, following the additive model (1) with orthogonal compo- 
nents; let 

n—2k 

V = dt{Qz(k + t) — Qi(k + t)), 

t = 1 

where Qi, i = 1,3 is introduced in (3) and d t is defined in (5). 

It follows that 



lim E{rj) — 0 

n— >00 
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Proof. 



( n—2k 

J2MQz(k + t)-Qr(k + t)) 

t= 1 
n—2k 

— ^ ^ dtE(Qs(k + t) — Q\(k + t)) = ... 

t= 1 

Let T^t be an index such that Qi{k + 1) = X Tit , i — 1, 3, then 




n — 2k 

■ • • = E - **..) = 

t=l 

n—2k 

= ^ d t E (T r3t + SV 3it + [/ T3t - (T Tlit + S Tl t + U Tlit )) = 

t = i 

n—2k n—2k 

= ^ d,E(Tj,j - T ri t ) + 53 d,£(S TJJ - S^, t ) + 

*=1 t=l 

72 — 2k n — 2k 

+ 53 d,E(U nj - U Tl t ) = 53 d,(T„., - T„.,) + 

t=l t = 1 

n — 2k n—2k 

+ E d ‘< S 'w “ S n,,) + E ‘W’M - U n.,)\ 

t=l t=l 

T t is a continuous function of t then, by using the Lagrange the- 
orem, 

-^T3,t ~~ = C t{rz,t — ' 7_ l ) 5 

for each t. It follows: 



\ T rz,t - T t J < c\r 3>t - T ht \ < 2 kC 
with C = maxQ and 2k = max \ Tz,t—T\^\- Given these inequalities 
we have: 



72 — 2k 



E Mr 

t= 1 




72 — 2k 

< E KIK 1 ^,. 



t=l 



n—2k 

T n ,)\ < 2kC E Mil 

t=l 
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where the last summation goes to zero as n goes to infinity, in 
fact: 



n—2k 



Y k 




T^=i k \t-t\ 

|t - tp 
0(n 2 ) 

2k) [(n — 2k) 2 - 1] /12 



~ O 



1 



-> 0. 



n ) n-> oo 



With similar steps also the summation about the seasonal S t 
component vanishes as n goes to infinity. 

On the other side the summation about the stochastic com- 
ponent U t is zero because 



n—2k n—2k 

E d > Ei - u ™,< - ^u) = E d < l^..) - 

t= 1 t= 1 

n—2k n—2k 

— d>t(£ o.75 — Co. 25 ) = (C 0.75 — C 0 . 25 ) dt — 0 

t-i t = 1 

being Co. 25 and Co. 75 the population quantiles (Rohatgi, 1984, p.616) 
of the irregular component U t and Ym= \ k dt = 0. 

❖ 

The sampling distribution of rj is investigated through a Monte 
Carlo simulation discussed in the next section. 

3 Empirical Results and Test Procedure 

The Monte Carlo study has been carried out considering model 
(1) where the linear trend is generated as: 

T t = 0.04 + 0.008f (6) 

and the seasonal component is a Fourier transform: 
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s/2 

St — A 0 + ''y ^ [ A{ cos (cuit) + Bi sin(u^i)] (7) 

z=0 

with Ai = Rcos( 0.05), Bi = — i?sm(0.05) and R — 2. The irregu- 
lar component is an ARMA(p,q) process: 

MB)U t = 9 q {B)£ t (8) 

with gaussian white noise errors, e t , and coefficients of </> and 9 
shown in Table 1. The autoregressive and moving average orders 
of the ARMA(p,q) components are p = 1 and q € {1,2,3} re- 
spectively. The combination of the different orders give rise to 3 
additive homoskedastic models, labeled in the following with Oil, 
012 and 013. For each of them 1000 time series are generated, 
with seasonal period s = 12 and length n = 200. 



Table 1 . Coefficients of the simulated models 



Coefficient 


01 6\ 02 03 


Value 


0.57 0.36 0.12 0.11 



For all simulated series the rj parameter has been estimated 
and the descriptive statistics of its empirical distribution are shown 
in Table 2. The mean of the distributions is always zero, according 
to the theoretical results in the lemma. The variance also appears 
constant with respect to q, while some differences are in the shape 
indexes. 



Table 2. Descriptive statistics of the sampling distribution of rj 



Model 


Mean 


Variance 


Skewness 


Kurtosis 


Oil 


0.000 


9.57 E-6 


0.0431 


3.1339 


012 


0.000 


9.26 E-6 


0.0312 


0.0607 


013 


0.000 


9.99 E-6 


-0.1069 


2.6734 



The results in Table 2 allow to joint the three different distri- 
butions into one. The sampling quantiles of the joint distributions 
are used to define a double side critical region of size a = 0.05, 
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which can be used to test the homoskedasticity of the time series. 
The empirical rule is: 



\v\ > Vo (9) 

where rj 0 = 0.0061. It is evaluated as the mean value between 
|F _1 (0.025)| and | F~ l (0.975) |, being F~ l the inverse of the sam- 
pling joint distribution function. 

In order to assess the performance of (9), heteroskedastic time 
series have been generated through two more models. The first 
is an heteroskedastic additive model (E) obtained combining the 
unobserved components as in the homoskedastic case but with 
amplitude of the seasonal component dependent from t as: 

# = 0.03 + 0.01* 

In this case, the size of the seasonal oscillations increases with 
the level of the series, as is frequently observed in most economic 
time series. 

Finally, the multiplicative model (M): 

X t = T t x S t x U t (10) 

is given by the product of the three components in (6), (7), and 
(8). This model also generates heterosckedatic time series. 

The sampling distributions of rj obtained through the additive 
models, both homoskedastic and heteroskedatic, are shown in Fig- 
ure 3. The parallel box plots highlight the symmetry and the equal 
variability of the empirical distributions of rj when additive mod- 
els are examined, even thought the center of the distributions can 
be different when homoskedastic or heteroskedatic time series are 
considered. 

The distribution of rj in the multiplicative case cannot be 
graphically compared with the previous distributions because of 
scale inflation. The corresponding plots are in Figure 4 where the 
distribution of model Mil is remarkably different from models 
M12 and M13 both in terms of position and variability. 

The performance of the empirical rule (9) has been assessed 
through the comparison between its rejecting percentage with 





Mil M12 M13 



Fig. 4. Parallel box plots of the distribution of r] for the multiplicative models 

that of the homoskedasticity test of Goldfeld and Quandt (1965) 
The numerical results are in Table 3. 
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In the homoskedastic case the proposed rule has an empirical 
level closer to the nominal one than the G-Q test. When hete- 
roskedastic time series are considered, the results are almost the 
same. For this experiment the value of k in ( 2 ) has been fixed at 
12. Further investigation with k = 24 does not change the results. 



Table 3. Percentage of the simulated time series satisfying the rule (9) and the 
Goldfeld- Quandt homoskedasticity test, with a = 0.05 





• Oil 


012 


013 


Ell 


E12 


E13 


Mil 


M12 


M13 


Rule (9) 
G-Q test 


0.049 

0.058 


0.050 

0.079 


0.045 

0.066 


0.960 

0.972 


0.966 

0.986 


0.959 

0.984 


1 

0.996 


0.995 

0.999 


0.990 

0.995 



4 Concluding Remarks 

In this paper we have proposed and index 77 detecting the presence 
of heteroskedasticity in a time series. This index inherits the prop- 
erty of resistance from the two smoothers needed for the evalua- 
tion which are based on the sample quartiles 1 and 3. We explored 
the theoretical behavior of the index in the case of an homoske- 
dastic time series and we found that the index 77 has expected 
value equal to zero in this case. The theoretical convergence to 
this value is better as the length of the time series increases. The 
empirical experiments confirms such a results. 

Given these results, and taking into account the knowledge 
about the sampling distribution of 77 provided by the Monte Carlo 
experiment, we have also proposed a simple significance test where 
the null hypothesis states that the time series is homoskedastic. 
This test have been compared to the one of Goldfeld and Quandt. 
We found that our proposal has the empirical level of significance 
closer to the nominal one than in the case of the Goldfeld and 
Quandt test. 

Further investigations, not shown in the paper, have been per- 
formed in order to evaluate the robustness of the index 77 when the 
time series are contaminated with additive outliers. The encour- 
aging results obtained represent the starting point of the future 
promising direction of the research. 
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Another direction has been the extension of the results shown 
in sections 2 and 3 to the case of more complex trend and seasonal 
structures. 



Acknowledgement: We thank the anonymous referee for his comments and 
suggestions allowing us to improve this paper. 



References 

BOX, G.E.P. and COX, D.R. (1964): An analysis of transformations, Journal of 
the Royal Statistical Society (B) , 26, 211-243. 

BOX, G.E.P. and JENKINS, G.M. (1976): Time series analysis, forecasting and 
control , San Francisco: Holden Day. 

GIORDANO F., NIGLIO, M. and STORTI, G. (2000): A simulation study for 
the evaluation of the seasonal adjustment and forecasting performances of the 
TESS system, Statistica Applicata , 12, 341-360. 

GOLDFELD, S.M. and QUANDT, R.E. (1965): Some tests for homoscedasticity, 
Journal of the American Statistical Association, 60, 539-547. 

JARQUE, C.M. and BERA, A.K. (1987): A test for normality of observations and 
regression residuals, International Statistical Review, 55, 163-172. 

ROTHAGI, V.K. (1984): Statistical inference, John Wiley &; Sons 




Part III 



Computer Intensive Techniques and 
Algorithms 




Smoothing Score Algorithm for 
Generalized Additive Models 



Claudio Conversano 

Dipartimento di Economia e Territorio, Facolta di Economia, Universita di 
Cassino, Via Mazzaroppi, 1-03043, Cassino (FR), Italy 



Abstract. In the framework of Generalized Additive Models (GAM) an automatic 
data-driven procedure is introduced for assigning an appropriate smoother to each 
covariate and for defining an ordering entrance for the covariates in the model. 
The resulting Smoothing Score algorithm aims to improve model indentifiability. It 
uses the bagging procedure in order to select the smoothers to be assigned to each 
covariate and a new scoring measure able to rank the candidate smoothers with 
respect to their bagged predictive accuracy. The adequacy of this scoring measure 
is evaluated on artificial data. A comparison between the smoothing score algorithm 
and the standard GAM is made using real data concerning a classification task. 



1 The Framework 

Generalized Additive Models (GAM), represents a powerful tool 
for dealing with nonlinear data structures in the framework of 
regression analysis. According to Hastie and Tibshirani (1990), 
the GAM model is defined as: 

E[Y\X] = Gja + £a(-V)} (1) 

In eq. 1 the conditional expectation of the response variable Y 
is modelled with an additive combination of d predictors, whose 
influence on the response is evaluated according to the associ- 
ated smoother, corresponding to one of the possible smoothing 
functions /&(•), (k = 1 ,...,AT), in a set of K possible alterna- 
tive smoothers. These functions, generally known as scatterplot 
smoothers, ranges from non-parametric tools - such as locally 
weighted polynomials (known as lowess), smoothing splines, ker- 
nel and near-neighbor estimators - to linear parametric regression 
estimators such as simple and polynomial regression. The esti- 
mation procedure is based on the backfitting algorithm, which 
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cycles through the individual terms fk(Xj) in the additive model 
and updates them by smoothing suitably defined partial residuals. 
The link function G can usually be specified as the identity link, 
used for linear and additive models for Gaussian response data 
or as a function belonging to the exponential family distribution. 
In this case, a local scoring algorithm using G produces in each 
iteration a new working response and a system of weights, which 
are handed to a weighted backfitting algorithm. 

2 GAMs Model Identifiability 

Although GAM are very popular among most of the statisticians, 
they lack with respect to the problem of the correct model identi- 
fiability. In fact, by changing the ordering entrance of the predic- 
tors in the model as well as the type of their associated smoother 
(or some smoothing parameter), the result of the final estimation 
could change considerably. Actually, Hastie (1992) introduced a 
stepwise selection procedure for GAM based on the Akaike In- 
formation Criterion (AIC). This procedure, implemented in the 
S-Plus routine step . gam, allows the user to step through arbitrary 
models in a prespecified path, namely, once that the user expressly 
defines an ordered regimen of candidate smoothers for each pre- 
dictor to be included in the model. From a practical viewpoint, 
it is impossible to specify a-priori and arbitrarily the whole set of 
possible smoothers (and the corresponding degree of smoothness) 
for each predictor. Some empirical trials demonstrated that this 
procedure is very sensitive to small perturbations in the data and 
it stalls when dealing with a lot of predictors. 

In the following, in order to find some remedies for these draw- 
backs, we introduce an automatic procedure for a simultaneous 
selection of smoothers and predictors in GAM. Following the ap- 
proach recently introduced by Conversano (2001) for the Gen- 
eral Additive Multi-Mixture Models (GAM-MM) of Conversano 
et. al (2002), the idea is to use both the bagging procedure and a 
smoother scoring criterion as tools to make a simultaneous and au- 
tomatic selection of the predictors (with their associated smooth- 
ing functions) in the first iteration of the backfitting algorithm, in 
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order to improve the model identification. The outline of the paper 
is the following: section 2 provides some definitions and an exam- 
ple of bagging-based scoring functions for smoothers and variable 
selection; in section 3 the estimation procedure based on the def- 
inition of an ordering entrance for the predictors in the model 
is introduced; the benchmarking of our methodology is proposed 
in section 4 by analyzing a real data set and some concluding 
remarks are resumed in section 5. 

3 Bagging and Scoring for Smoothers 
Selection 

3.1 Notation 

We define a set of K alternative smoothers /&(•), (k = 1, . . . , K), 
presenting different degrees of smoothness. Our procedure selects 
the most appropriate /*.(•) for each predictor to be included in 
the model. Then, we are dealing with the problem of assigning 
the appropriate smoother /*,(•) for each candidate predictor Xj, 
(j = 1 , ,d) to be included in the model. On this purpose, for 
the j-th predictor we define: 

• <f>[fk(Xj,b)]: a function ranging in [0,1] expressing the ad- 
equacy of each candidate smoother for the current predic- 
tor. Without loss of generality, we assume that </>(•) is a non- 
decreasing function, that depends on the data (i.e. the j-th 
predictor) as well as on the goodness of fit provided by each 
smoother /*(•) to the data. 

• v {fk)- a function ranging in [0,1] expressing the cost for the 
noncomplexity induced by the k- th smoother, depending on 
its number of parameters. 

• ^{[fkiXj, &)], v(fk)}: a normalized scoring function depending 
upon a combination of </>(•) and t>(-). 

For the selection of the most appropriate smoother for each pre- 
dictor Xj, we draw B bootstrap samples from the data. For the 
b - th bootstrap sample (b = 1, . . . , B) , we search for the smoother 
presenting the highest score, i.e. the highest value of the scoring 
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function 'ip(-). On this purpose, we express the result arising from 
the 6-th sample as follows: 

Pkbj = 4>[fk(Xj , b )], v(fk)} (2) 

In eq. 2, p^j is a normalized measure pointing out the smoother 
assigned to the j-th predictor, such that ^ fc=1 Pkbj = 1- For each 
bootstrap replication, we compute the pkbj ' s and obtain the most 
suitable smoother selection for each predictor Xj over all the B 
bootstrap samples according to a consensus criterion. Practically, 
the choice of the most appropriate smoother (with the associated 
degree of smoothness) for each predictor is a synthesis of the scor- 
ing functions evaluated in each bootstrap sample. 



3.2 The Smoothing Score Function 



With respect to eq. 2, possible characterizations of the scoring 
function could be the Mallows C p statistics or the Aikaike AIC 
statistics. When using these measures in order to summarize boot- 
strap results, some difficulties arise when trying to summarize the 
results coming out from the different bootstrap replications. 

For such reason, we introduce an empirical measure that takes 
into account, for each combination of candidate smoother and 
smoothing parameter, the gain in the fit it provides with respect 
to the most parsimonious alternative, and downgrade this gain 
with respect to the complexity induced by its number of param- 
eters. This function, that we named Smoothing Score, is defined 
as: 



Pkbj 



^ QCCkbj J | ^ Ffc 

acci bj ) \ t k 



( 3 ) 



where k = 1, . . . , K denotes the smoothing functions, to be or- 
dered from the most to the least parsimonious, associated with 
the j-th predictor in the b-th bootstrap sample. 

In coherence with eq. 2 we set <f>[fk(Xj, 6)] = 1 — acckbj / accibj , so 
that acc^j and acc^j are measures of accuracy (as for example 
the Residual Sum of Squares) provided by the A;-th smoother and 
the most parsimonious smoother respectively. Moreover, we set 




Smoothing Score Algorithm for Generalized Additive Models 



99 



v(fk ) = 1 — r k /r K , so that r k and tk are the number of param- 
eters of the k-th smoother and the less parsimonious smoother 
respectively. 

In the 6-th sample, for each predictor Xj the most appropriate 
smoother is selected according to the highest value of eq. 3. In 
this respect, this choice has to be considered optimal because it 
counts for the trade off between the fitting each smoother pro- 
vides and the complexity induced by the number of associated 
parameters. 

Therefore, in the 6-th sample we use an indicator variable 5[j b such 
that: 

c _ / 1 in = argma x k (p kjb ) 
l ^ b 0 otherwise 

In this way, for each predictor Xj the bagged smoother selection is 
obtained according to the majority vote criterion, here specified 
as: 




so that fj denotes the final most voted smoother for the j-t h pre- 
dictor. The proportion of times this smoother has been selected 
indicates the degree of reliability for the relation between the pre- 
dictor and the response variable. 

3.3 Computational Issues 

We have implemented the proposed procedure for smoother and 
variable selection in GAM in an S-Plus routine, that we named 
bag. smooth, score. Here, a complete set of candidate smoothers 
is a-priori built in, and the user provides the set of smoothing 
parameters in the span argument. The routine has the following 
syntax: 

bag . smooth . score (data , span , index , boot ) 

It requires as input arguments the name of the data frame (data), 
the vector of possible smoothing parameters (span), the index 
evaluating the fit produced by each scatterplot smoother and the 
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number of bootstrap replications (boot) for producing a bagging 
evaluation of this fit. Once that a sample has been drawn from 
the original data set, the bag. smooth. score function computes 
the fitting provided by each scatterplot smoother with respect to 
a user specified vector of possible smoothing parameters, and for 
each predictor it chooses the smoother according to a suitable 
measure. Once that the resampling scheme is fulfilled, it returns 
for each predictor the most-voted smoother fj according to eq. 5. 

3.4 An Example on Simulated Data 

In this section, we present the results of a simulation study in 
order to show how the Smoothing Score function works in prac- 
tice. Starting form a vector X of n = 100 independent and uni- 
formly distributed observations we generated two different data 
structures, termed as siml and sim2, according to the following 
equations: 

siml: ?/j = 5 + 1.6a:ji + £j £ ~ iV(0, 0.25) 

sim2: y, = 5 — 4 sin(7rxji) + log^Xn) + £* e ~ iV(0, 2.5) 

with i — 1 . . . , n. We obtained two different structures, of the 
linear type (first case) and of nonlinear type (second case). Scat- 
terplots of the two data frames are showed in fig. 1. For these 

First Data Stuclure (siml) 



Fig. 1 . Scatterplots of the two simulated data structures 




Second Data Stuclure (sim2) 




structures, we evaluated the outcome of eq. 2 (with respect to 
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b = 100 bootstrap replications) and the behavior of the Smooth- 
ing Scores function p(-) in a generic bootstrap sample (for ex- 
ample, when b = 1). The results are summarized in tab.l. The 
different smoothers are sorted with respect to their number of 
parameters. The degree of the interpolating polynomial for the 
spline estimators and the magnitude of the smoothing parame- 
ter for the lowess estimators with one or two degrees of freedom 
are reported in brackets. The second column shows the number 
of parameters t*, of each smoother whereas in the third we com- 
puted the outcome of the function v(fk)- For each bootstrapped 
sample, our bag. smooth, score routine cycled through 15 alter- 
native smoothers, and chose the right one after having sorted in 
ascending order the different fittings with respect to their level 
of parsimony (rfc). The results underline that our approach priv- 
ileges the linear estimator in the first case, by imposing a strong 
penalization to the less parsimonious smoothers, whereas in the 
second case the lowess estimator with 2 degrees of freedom and 
a bandwidth equal to 2/3 is preferred, despite the functions pre- 
senting higher degrees of smoothness provide better fits. These 
results are coherent with the hypothesis formulated when gener- 
ating the two data structures. 
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Table 1. The smoother selection for a generic bootstrap sample 





siml 


sim2 


Smoother 


Tk 


Hfk) 


miXjM 


Pi,k 


PlOO, A; 


4>[fk{x jt b)] 


pi,k 


Pl00,k 


intercept 


1 


0.933 


0.000 


0.000 


0.000 


0.00 


0.000 


0.00 


linear 


2 


0.867 


0.991 


0.101 


1.00 


0.116 


0.020 


0.00 


lowess(l, 0.66) 


3 


0.800 


0.991 


0.094 


0.00 


0.529 


0.085 


0.00 


spline(3) 


4 


0.733 


0.991 


0.086 


0.00 


0.553 


0.082 


0.00 


lowess(l, 0.5) 


4 


0.733 


0.991 


0.086 


0.00 


0.630 


0.093 


0.00 


spline(4) 


5 


0.667 


0.991 


0.078 


0.00 


0.651 


0.087 


0.00 


lowess(2, 0.75) 


5 


0.667 


0.991 


0.078 


0.00 


0.649 


0.087 


0.02 


lowess (2, 0.66) 


5 


0.667 


0.991 


0.078 


0.00 


0.691 


0.093 


0.62 


spline(5) 


6 


0.600 


0.991 


0.070 


0.00 


0.698 


0.084 


0.00 


lowess (1, 0.33) 


6 


0.600 


0.991 


0.070 


0.00 


0.701 


0.085 


0.00 


lowess (2, 0.5) 


7 


0.533 


0.991 


0.062 


0.00 


0.715 


0.077 


0.00 


lowess (1, 0.25) 


8 


0.467 


0.991 


0.055 


0.00 


0.715 


0.067 


0.00 


lowess (2, 0.33) 


10 


0.333 


0.991 


0.040 


0.00 


0.728 


0.049 


0.33 


lowess (2, 0.25) 


14 


0.067 


0.991 


0.008 


0.00 


0.742 


0.010 


0.01 



4 A Modified Backfitting Algorithm for 
Correct Model Identifiability 

Once that we founded the most suitable smoother for each pre- 
dictor, we are able to define a variable selection procedure for the 
definition of an ordering entrance of the predictors in the model. 
The different steps of the modified backfitting procedure, that we 
named Smoothing Score Algorithm, are summarized in table 2. 

In the first step, the algorithm provides the sequence of ordered 
predictors {Xp), X( 2 ), ■ ■ ■ , Xpp . . . , Xp))} to be possibly included 
in the model, whereas in the third step it produces the partial 
residuals in each sub-iteration (r), and evaluate the improvement 
in the fitting provided by the last predictor entered in the model 
according to the Anova-like F-test widely used in GAM modelling 
(see Hastie (1992, p.254) and Schimek (2000, p. 305)). Once that 
a predictor is included in the model, the partial residuals are com- 
puted for the sub-iteration r + 1. The bagging procedure is applied 
on these partial residuals in order to select different smoothers for 
each candidate predictor. Repeating recursively step 3, the algo- 
rithm stops the variable selection procedure when the inclusion of 




Smoothing Score Algorithm for Generalized Additive Models 103 



one among the (d — s + 1) candidate predictors is not significant, 
as well as when all the d predictors are included in the model. 
Then, the standard backfitting procedure is applied until conver- 
gence for the model frame defined with the selected predictors. 



Table 2. The Smoothing Score Algorithm 



1. Ordering. Covariates are sorted decreasingly, according to the bagged 

smoother selections f* defined in eq. 5. 

2. Initialize, i r fs^ r \X( s )) «— 0, Vs = 1, . . . , d 

3. Update. 1 «— r 

For the 5-th predictor (5 = 1, . . . , d) 

Fit the function fl^ r \x^) to the partial residuals 

e< r_1 ) <- [y-/; (r_1) (x (s) )]. 

Test if the improvement in the goodness of fit of the overall model induced 
by the entrance of the 5-th predictor is significant. 

Define a new ordering entrance by applying step 1 to the partial residuals: 

e (r) My-/; w (x w )] 

r — r + 1 

4. Verify. Cycle step 3 and stop the variable selection procedure when any of 

the {d — 5 + 1) predictors improves significantly the overall goodness 
of fit. 

5. Backfitting. Apply standard backfitting to the residuals [Y — fs(X s )] 



5 A Comparison with Standard GAM 

Our procedure for variable selection in GAM was applied to the 
Bupa Liver Disorders data set, contributed to the UCI Machine 
Learning Repository (Blake and Mertz, (1998)). The goal was to 
classify 345 individuals on the basis of two possible levels of alco- 
hol consumption. In fact, the response variable is dichotomous and 
the initial error rate is 0.516, corresponding to the proportion of 
people presenting high alcohol consumption with respect to those 
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classified as regular drinkers. The classification problem involves 
six predictors, one concerning the number of half-pint equivalents 
of alcoholic beverages drunk per day (drinks), the others con- 
sisting in blood tests which are thought to be sensitive to liver 
disorders that might arise from excessive alcohol consumption, 
namely the mean corpuscular volume ( mcv ), the alkaline phospho- 
tase ( alkphos ), the alamine aminotransferase ( sgpt ), the aspartate 
aminotransferase ( sgot ) and the gamma-glutamyl transpeptidase 
( gammagt ). 

We use the step. gam approach of Hastie (1992) as a benchmark 
for our methodology and compare the relative results. In the 
step . gam approach, the model selection is made stepping through 
arbitrary models in a prespecified path, namely, once that the user 
expressly defines an ordered regimen of candidate smoothers for 
each predictor to be included in the model. The selected model 
is the one minimizing a modified Akaike Information Criterion 
(AIC). 

For our experiment, in the variable selection step all the pre- 
dictors were included in the GAM model for the two different 
approaches (values of the scoring functions are not showed here 
for sake of brevity) . We compare the results of the final models, 
namely the estimation procedures following the AIC criterion and 
the Smoothing Score algorithm, in terms of the significance in the 
improvement of the goodness of fit provided by each predictor in 
the standard backfitting with respect to our modified backfitting. 
The results of the Anova-like F-Test for the two approaches are 
summarized in tab. 3. It is evident that our approach overperform 
standard GAM. In fact, the error rate provided by our approach 
based on bagging and smoothers scoring is lower than the one 
produced by standard GAM with respect to all the considered 
models, namely once that a further predictor is included in the 
model. At the end, the final error rate in our case is 0.201, whereas 
for the standard GAM is 0.370. Moreover, when adding a predic- 
tor in our model, the contribution it induces in the goodness of 
fit of the data expressed by the relative decrease in the Residual 
Sum of Squares is always significant (the p- values are always lower 
than 0.05). This is not the case of the standard GAM, where for 




Smoothing Score Algorithm for Generalized Additive Models 105 



the third predictor the p-value is 0.08. 



Table 3. A comparison between the performance of the two different approaches. 



Standard GAM | 


Model 


n. of 

parameters 


Residual 

Deviance 


F 


Prob. 


Error 

Rate 


intercept 


1 


449.4 


- 


- 


0.516 


lowess (mcv) 


4 


458.7 


2.84 


0.02 


0.504 


alkphos 


1 


455.7 


3.07 


0.08 


0.501 


splin e(sgpt) 


4 


440.2 


4.15 


0.00 


0.484 


Sgot 


1 


410.2 


33.22 


0.00 


0.484 


splme(gammagt) 


4 


355.7 


13.96 


0.00 


0.391 


spline (drinks) 


4 


337.8 


4.73 


0.00 


0.370 



GAM based on bagging and scoring | 


Model 


n. of 

parameters 


Residual 

Deviance 


F 


Prob. 


Error 

Rate 


intercept 


1 


449.4 


- 


- 


0.516 


lowess (drinks) 


4 


439.6 


5.24 


0.00 


0.406 


lowess (sgot) 


4 


418.2 


3.45 


0.00 


0.365 


alkphos 


1 


402.6 


2.53 


0.02 


0.319 


lowess (sgpt) 


4 


377.1 


6.16 


0.00 


0.264 


splin e(gammagt) 


3 


331.0 


14.26 


0.00 


0.209 


lowess (mcv) 


4 


316.0 


3.87 


0.00 


0.201 



6 Concluding Remarks 

According to Hastie et. al (2001), GAMs assume a structured form 
for the unknown regression function, and by doing so they finesse 
the curse of dimensionality. Of course, they pay the possible price 
of misspecifying the model, and so in each case there is a tradeoff 
that has to be made. The proposed procedure for variable selec- 
tion in GAMs try to handle with these problems. 

Our approach uses the bagging procedure introduced in the frame- 
work of decision trees by Breiman (1996), who demonstrated how 
this procedure efficiently reduces the variance of regression pre- 
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dictors, while leaving the bias unchanged. In our case, bagging is 
used as a tool to improve the accuracy of the selection of the set of 
most powerful predictors (with the associated smoother); practi- 
cally, we evaluate the sensitiveness of each variable and smoother 
selection with respect to small perturbations induced in the data. 
We have shown that our automatic procedure for variable and 
smoother selection in GAM improves the identification of the cor- 
rect model and, consequently, the model fitting. 

The theoretical motivation underlying this procedure is that with 
this approach the identification of the correct model is made in the 
first iteration of the backfitting algorithm. In this way, the iden- 
tification and the estimation steps are not separated, opposed to 
the step, gam approach. The evaluation of the predictive accuracy 
of each candidate predictor based on the partial residuals of the 
models allows also to reduce the problems of multicollinearity and 
concurvity, that typically affect model fitting. 

In the application, we achieved encouraging results in the case 
of a categorical response, but the Smoothing Score algorithm can 
be applied straightforwardly in the numerical response case. In 
general, the bagging and smoothing score approach can be fruit- 
fully used when dealing with complex data structures and in the 
presence of a lot of predictors. 

A wide discussion with some experiments about the effectiveness 
of the proposed scoring function for GAM-MM can be found in 
Conversano (2002). 

Future works will be to extend this approach to the case of a 
multidimensional fit, namely when a single smoothing function is 
used to jointly model more than one predictor. 



Acknowledgements. Author is grateful to anonymous referees 
and to Prof. Roberta Siciliano of the University of Naples Federico 
II for her insightful comments and remarks. The research was 
supported by MURST funds 1999 (awarded by R. Siciliano) and 
MURST funds 2001 (awarded by F. Mola). 




Smoothing Score Algorithm for Generalized Additive Models 107 



References 

BREIMAN, L. (1996). Bagging predictors, Machine Learning , 26, 46-59. 

BLAKE, C.L. and MERTZ, C.J. (1998). UCI repository of machine learning 
databases, http://www.ics.uci.edu. 

CONVERSANO, C. (2002). Bagged Mixtures of Classifiers using Model Scoring 
Criteria, Journal of Pattern Analysis and Applications , 5, 4, 351-362. 

CONVERSANO, C. (2001). Generalized Additive Multi-Mixture Models: Scoring 
Models and Predictors, In: Klein, B. and Korsholm, L. (eds.), New Trends 
in Statistical Modelling, Proceedings of the 16th International Workshop on 
Statistical Modelling , University of Southern Denmark, 103-110. 

CONVERSANO, C., SICILIANO, R. and MOLA, F. (2002). Generalized Additive 
Multi-Mixture Models for Data Mining, Computational Statistics and Data 
Analysis , 38, 4, 487-500, special issue on Nonlinear Methods and Data Mining. 

HASTIE, T., FRIEDMAN, J. and TIBSHIRANI, R. (2001). Elements of Statistical 
Learning , Springer, New York. 

HASTIE, T. and TIBSHIRANI, R. (1990). Generalized Additive Models , Chapman 
& Hall, London. 

HASTIE, T.(1992). Generalized Additive Models, In: Chambers, J. M. and Hastie, 
T. (eds.): Statistical Models in S , Wadsworth &; Brooks/Cole Computer Science 
Series, Pacific Groove, California. 

SCHIMEK, M.G. (2000). Additive and Generalized Additive Models, In: Schimek 
M.G. (Ed.), Smoothing and Regression , John Wiley & Sons, New York. 




Bootstrap Variables Selection in 
Neural Network Regression Models 



Francesco Giordano, Michele La Rocca, and Cira Perna 



Department of Economics and Statistics, University of Salerno 
Via Ponte Don Melillo, 84084 Fisciano (SA), Italy. 
e-raaz/:[giordano,larocca,perna]@unisa.it. 



Abstract. In this paper we consider the problem of variables selection in a non 
linear regression model with dependent errors. In this framework, we discuss the use 
of some measures for the variables relevance to the neural network model and we 
propose the use of the moving block bootstrap technique to estimate the variability 
of these measures. The performance of the procedure is evaluated by a small Monte 
Carlo experiment which shows how the proposed approach determines a correct 
ranking among relevant and irrelevant variables. 



1 Introduction 

Diagnostic and specification tests for model selection form the 
basis of a broadly accepted and coherent modelling approach, en- 
suring that the estimated model is a good approximation of the 
data generating process. In a regression framework, a key issue 
is variables selection, to avoid omission of relevant variables or 
inclusion of irrelevant ones. When using neural networks to es- 
timate the regression function, testing whether the weights are 
significantly different from zero could be misleading since differ- 
ent network topologies can lead to similar approximation accu- 
racy. Therefore, variables selection cannot be faced by focusing 
on single weights since, due the black-box nature of the network, 
it is difficult if not impossible to interpret them. Moreover, this 
approach can be inadequate in testing the overall significance of 
an explanatory variable and does not allow an understanding of 
its effect on the prediction of the model. From a statistical stand- 
point, it seems much more helpful to introduce a measure for the 
variable relevance and an estimate of its standard error in order 
to derive a formal statistical test. 
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In this paper we propose and discuss the use of moving block 
bootstrap to estimate the variability of some relevance measures 
for variable selection in neural network regression models with 
dependent errors. The paper is organized as follows. In the next 
section we briefly review artificial neural networks and in section 
3 we focus on relevance measures and on the bootstrap procedure 
we propose. In section 4, we report the results of a Monte Carlo 
simulation experiment and in the last section some concluding 
remarks. 

2 Neural Network Regression Models 

Let {Y t , t = 1, . . . , T} be a time series modelled as Y t = / (x t )+•£<, 
where / is a non linear continuous function, x t = (xu, ■ ■ ■ ,%dt) 
is a vector of d non stochastic explanatory variables defined on 
M C 1Z d , and Z t are (possibly) dependent random variables with 
zero mean. The function / can be approximated by a single hidden 
layer feedforward network of the form: 

m / d \ 

9 ^ ^ ^ ^ Q'kj'Ejt “I - ^ko J Co 

k = l \j=i / 

where 0 = (c 0 , c u . . . , a' 1; . . . , a' m ) with a^. = (o fc0 , a kl , a kd ); c k , 
k = 1, . . . , m is the weight of the link between the A>th neuron in 
the hidden layer and the output; a k i is the weight of the connection 
between the j-th input neuron and the k - th neuron in the hidden 
layer; a k0 and c 0 are respectively the bias term of the hidden 
neurons and of the output; </>(•) is the logistic function. 

Neural network models can approximate any L 2 function and 
its derivatives to any level of accuracy (Cybenko, 1989; Hornik et 
al., 1990) and they become useful in high dimensional regression 
problems by looking for low dimensional decompositions (Bar- 
ron, 1991). They are highly flexible and provide predictions with 
a relatively small bias. These good properties make neural net- 
works natural candidates for modelling multivariate data but the 
choice of a proper topology for a given problem is still an open 
question. This requires choosing both an appropriate number of 




Bootstrap Variables Selection in Neural Network Regression Models 



111 



hidden units and a suitable set of explicative variables and, as a 
consequence, the connections thereof. 

The most popular approaches used so far are regularization 
and pruning (Reed, 1993). In the first method, the network weights 
are chosen such that they minimize the network error function 
plus a penalty term for the network complexity. While, when us- 
ing pruning methods, the aim is to identify those parameters that 
do not significantly contribute to the overall network performance. 
In both cases different topologies are chosen according to the sin- 
gle weights. 

Although these methods may lead to satisfactory results, look- 
ing at the single weights can be misleading. This approach does 
not give any information on the most ’’significant” variables, which 
is useful in any model building strategy and different topologies 
can achieve the same approximation accuracy. Therefore, a proper 
choice of the network can be just based on complexity reasons and 
not on model plausibility. 

All these techniques lack of a statistical perspective and are 
much more on the side of computational standpoint. Instead, it 
would be of some interest to look at the choice of the network 
topology in a statistical perspective, including this problem in the 
classical model selection approach (see Ripley, 1995, and the ref- 
erences therein). If focus is on diagnostic tests, Ardes and Korn 
(1999) propose a complete strategy based on a sequence of La- 
grange multipliers tests on sets of weights. Although the statisti- 
cal perspective is very strong, again, the role of the variables is 
not stressed. 

To emphasize the role of the variables for the choice of the 
neural network topology a different strategy can be employed. 
Generally a desirable neural network contains as few hidden units 
as necessary for a good approximation of the true function, taking 
into account the trade-off between estimation bias and variabil- 
ity. As a consequence, once fixed the hidden layer size by using 
some asymptotic results (see Perna and Giordano, 2001) or some 
alternative ad-hoc procedures (Faraway and Chatfiled, 1998), at- 
tention can be focused on model identification in a classical sense: 
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the selection of the explanatory variables, in order to remove the 
’’irrelevant” ones. 

The approach we consider here is based on the introduction 
of some measures for the variable relevance which also, in an 
exploratory view, allows a ranking of the available explanatory 
variables. Following Refenes and Zapranis (1999) this involves 
( i ) quantifying a ’’relevance” measure, (ii) estimating the sam- 
pling variability of this measure and (Hi) testing the hypothesis 
of insignificance. In this paper we focus on the use of resampling 
techniques to estimate the standard error of some relevance mea- 
sures. This choice is necessary in a neural network framework, 
where the use of asymptotic results if possible in principle, be- 
comes soon very difficult and almost impractical in real problems. 
Of course, when dealing with dependent data, residual bootstrap 
is not consistent and some alternative bootstrap schemes should 
be considered. 

3 Relevance Measures and the Moving Block 
Bootstrap for Variables Selection 

In a linear regression model the relevance of a variable is mea- 
sured by its coefficient which is also the magnitude of the partial 
derivative of the dependent variable with respect to the variable 
itself. So, in this set up, testing whether this coefficient is zero is 
equivalent to testing the hypothesis that the variable is relevant 
for the model. When dealing with nonlinear functions, the partial 
derivative is not a constant but it varies through the range of the 
independent variable Xj. Therefore, some kind of synthesis for the 
partial derivative over the range of Xj is needed. 

A number of criteria, each measuring different aspects of rel- 
evance, can be used. The most popular one is the mean deriva- 
tive, Mj = Mean (D t j) where D t j = dYt/dx t j is the change in 
Y t = g (x<; 9) for a very small perturbation in the Xj. Alternatively, 
to avoid cancellations between negative and positive values, it can 
be used the mean absolute derivative MAj = Mean (\D t j\) or the 

weighted quadratic mean QM = J Mean (D^) xa x /a z , where a x 
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is the standard deviation of the independent variable and a z is the 
standard deviation of the residuals. In some applications, such as 
financial studies, it is also important to detect if Y t is sensitive to 
the independent variable Xj only in a small percentage of the in- 
put variable. So, the indexes: MaxADj — max t (\D t j\), MaxDj = 
ma x t (D t j), MinADj = min t ( | D t] \ ) , MinDj = min* (-Dp) can 
also be considered. 

Since all the relevance measures are functionals of the param- 
eter vector, the sampling variability of their estimates could be 
computed analytically by a quite tedious process based on some 
asymptotic arguments. Alternatively, it can be used a feasible es- 
timator based on stochastic simulations such as bootstrapping. 

In the context of neural network regression models with iid er- 
rors, the residual bootstrap technique has been pursued in Tibshi- 
rani (1995) and Refenes and Zapranis (1999). When dealing with 
dependent data, this approach is not consistent and some modifi- 
cations of the original procedure are needed in order to preserve 
the dependence structure of the original data in the bootstrap 
samples. For this aim, we propose to use a blockwise resampling 
scheme, the Moving Block Bootstrap (MBB), which gives consis- 
tent results under very general and minimal conditions and enjoys 
the properties of being robust against misspecified models (Kun- 
sch, 1989; Politis e Romano, 1992 inter alia). This is particularly 
useful in neural networks where the specification of an adequate 
neural network topology can be difficult. 

Moreover, even if the ’’best” topology is selected, neural net- 
works are ” atheoretical” and employed for the lack of knowledge 
about the form of the data generating process. Therefore, it is 
reasonable to consider a neural network as an approximation to 
an underlying model and analyze it as being misspecified in the 
sense of White (1994). Finally, estimation procedures for neural 
networks do not produce unique optimums and very different so- 
lutions are related to very similar values for the loss function and 
so the model cannot be considered globally identified. 

The MBB algorithm can be adapted to possibly non stationary 
time series, in a neural network context, as follows. 
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Step 1. Compute the neural network estimates g \^x. t ; OJ for t = 
1, ... ,T the residuals Z t = Y t — g d'j and the centered resid- 
uals Z t = Z t - Eh Z t /T. 

Step 2. Fix L < T and form blocks of length L of consecu- 
tive observations from the original data, the bootstrap sample 
is Z* {j _ l)L+t = Z Sj+t , 1 < j < b, 1 < t < L where b = [ T/L \ 
denoting with [x] the smallest integer greater or equal to x. Let 
S\, £> 2 , • • • , Sb iid uniform on {0, 1, . . . , T — L). If T is not a multi- 
ple of L, only T + L — bl observations from the last block are used. 
The block length can be chosen as L = \C ■ T A ] where C > 0 and 
A £ (0, 1) with [x] denoting the closest integer to x. When es- 
timating a standard error A = 1/3 and the constant C can be 
derived numerically (Hall et ah, 1995) or by fixing it, as usual, at 
a value between 1/2 and 2. 

Step 3. From the bootstrap replicate {Z*, . . . , Z^}, generate the 
bootstrap replicate of the original series as Y t * = g (x t ; + Zf 

with t = 1, . . . ,T and compute the bootstrap estimates of the 
neural network parameters, 9*. 

Step 4. Given one of the relevance measure introduced before, 
let’s say ^ (9 ), derive the bootstrap analogue, S* = $s(9*). 

Then , the variability of ^ S is estimated by the bootstrap 

variance var* [3*], where var* [•] denotes the variance conditional 
on the observed data (Yt, x t ) t = 1, . . . , T. As usual, the bootstrap 
variance can be estimated by a Monte Carlo simulation. 

4 Monte Carlo Results 

A Monte Carlo experiment has been performed to illustrate the 
process of variables selection in the context of neural model iden- 
tification. The data has been generated according to the Wahba’s 
function as 



Y t = 4.26 (e~ x - 4e~ 2 * + 3e~ 3x ) + Z t 

with x £ [0, 2.5] and three different specifications for the er- 
ror terms Z t \ a white noise; an ARM A(l,l), specified as Z t = 
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—0.8 Z t -i — 0.5et_! + e t with the innovations e t distributed as a 
Student-t with 6 degrees of freedom; an EXPAR(2 ) model de- 
fined as 



Z t = (0.5 + 0.9 • exp • Z t _i~ 

- (0.8 - 1.8 • exp {-Zj _ 1 )) • Z t _ 2 + e t 

with the innovations e t distributed as standard normal. The first 
model (Ml) reproduces the usual regression model with iid errors; 
the second model (M2) is to consider a linear error process with 
non gaussian innovations with leptokutic distribution, a typical 
structure when considering some popular applications of neural 
networks such as those to financial data. Finally, the third model 
(M3) allows to consider both deterministic and stochastic nonlin- 
ear components. 

The simulations are based on time series of length T = 400, 
100 Monte Carlo runs and 50 bootstrap replicates. We limited the 
hidden layer size m in the set (3, 6} since the Wahba’s function 
can be approximated very well just with few neurons and values 
greater than 6 give serious overfitting problems. The block length 
L varies in the set {4, 6, 12} in order to study its impact on the 
performance of the procedure. This parameter is critical in most 
applications since if the block of consecutive observations is not 
long enough, the dependence in the original series could not be 
reasonably preserved in the resampled ones. 

We assume that the data generating process is unknown both 
in the functional form and in the relevant variables and we try 
to reproduce it by Y = ip ( x , w) + z where w is a white noise. 
The aim of the analysis is to show that the model identification 
procedure will correctly identify x as the most relevant variable. 
The sensitivity of the dependent variable to the two explanatory 
variables, x and w, is measured by the seven different relevance in- 
dexes defined in the previous section. They have been studentized 
as t = S/ ^/var*[A*] so that the larger this value is the higher is 
the probability that the variable is relevant. The means over the 
100 Monte Carlo runs are reported in the following tables. 

The relevant variable is correctly identified as x by all the 
relevance measures with a clear cut between the two variables and 
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Table 1. Variable significance statistics: sensitivity of the dependent variable to 
the explanatory variables, x and w. White noise error process. 



m 


L 


Variable 


M 


Ml 


QM 


MaxD 


Max AD 


MinD 


Min AD 


T 


T 


X 


10.924 


14.969 


8.377 


7.444 


5.462 


-1.757 


0.901 




U) 


0.042 


1.683 


1.589 


1.011 


1.195 


-1.043 


0.389 


6 


X 


9.356 


13.303 


7.228 


6.628 


4.485 


-1.690 


0.935 




w 


-0.151 


1.536 


1.320 


0.857 


0.950 


-0.811 


0.281 


12 


X 


7.837 


11.822 


6.532 


6.823 


4.163 


-1.762 


1.012 




w 


0.150 


1.974 


1.758 


1.204 


1.390 


-1.239 


0.406 


T 


T 


X 


2.822 


13.541 


7.150 


6.778 


4.297 


-4.077 


0.717 




w 


0.082 


2.226 


1.970 


1.333 


1.384 


-1.275 


0.548 


IT 


X 


2.869 


12.653 


6.774 


6.958 


4.230 


-3.786 


0.620 




U) 


-0.122 


2.352 


2.106 


1.432 


1.454 


-1.197 


0.478 


12 


X 


2.743 


12.157 


6.090 


6.346 


3.803 


-3.303 


0.790 




w 


0.054 


2.239 


1.942 


1.223 


1.306 


-1.174 


0.554 



Table 2. Variable significance statistics: sensitivity of the dependent variable to the 
explanatory variables, x and w. ARMA error process with Student-^ innovations. 



771 


L 


Variable 


M 


MA 


QM 


MaxD 


Max AD 


MinD 


Min AD 


T 


T 


X 


15.306 


25.474 


10.820 


12.890 


9.385 


-4.688 


1.218 




w 


0.019 


1.722 


1.549 


1.087 


1.280 


-1.098 


0.355 


6 


X 


12.543 


25.225 


9.867 


14.128 


9.571 


-6.047 


1.233 




w 


-0.054 


1.853 


1.736 


1.058 


1.480 


-1.331 


0.353 


12 


X 


11.199 


25.109 


8.818 


14.045 


8.877 


-5.736 


1.011 




w 


-0.001 


1.887 


1.707 


1.379 


2.610 


-2.218 


0.297 


T 


T 


X 


7.111 


29.038 


11.308 


7.665 


8.217 


-7.542 


0.817 




w 


0.084 


2.141 


1.845 


1.162 


1.257 


-1.137 


0.531 


~6~ 


X 


5.783 


30.771 


10.306 


8.209 


8.708 


-8.080 


1.005 




w 


0.052 


2.283 


1.874 


1.448 


1.563 


-1.225 


0.548 


12 


X 


6.099 


29.249 


9.551 


7.238 


8.502 


-8.226 


0.960 




w 


-0.131 


2.100 


1.720 


1.362 


1.304 


-1.218 


0.543 



a unambiguous ranking comes out. Even in the white noise case, 
where no dependence among errors is present, the MBB works 
satisfactorily. In addition, for reasonable choices, the block length 
does not seem so critical. 

The size of the hidden layer seems to have a greater impact on 
the value of the relevance measures. For m = 6 overfitting problem 
could arise and the discrimination between the relevant variable 
and the irrelevant one could be not clear as the case of m = 3. 
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Table 3. Variable significance statistics: sensitivity of the dependent variable to 
the explanatory variables, x and w. Expar error process. 



m 


L 


Variable 


M 


MA 


QM 


MaxD 


Max AD 


MinD 


Min AD 


T 


4 


X 


12.064 


17.051 


9.497 


8.928 


6.145 


-2.111 


1.068 






w 


-0.085 


1.759 


1.595 


1.108 


1.177 


-0.994 


0.318 




6 


X 


10.091 


16.260 


9.114 


9.481 


6.105 


-2.951 


0.905 






w 


0.051 


1.827 


1.676 


1.114 


1.338 


-1.100 


0.333 




12 


X 


8.863 


14.662 


8.337 


8.959 


5.288 


-2.543 


0.855 






w 


0.061 


1.760 


1.589 


1.025 


1.199 


-1.023 


0.269 


T 


4 


X 


3.856 


17.499 


9.767 


7.108 


5.789 


-5.399 


0.726 






w 


0.140 


2.046 


1.747 


1.132 


1.181 


-1.018 


0.573 




6 


X 


3.530 


16.563 


8.965 


7.949 


5.399 


-5.100 


0.813 






w 


-0.013 


2.404 


2.100 


1.349 


1.494 


-1.331 


0.579 




12 


X 


3.419 


16.934 


9.161 


7.263 


5.340 


-5.030 


0.662 






w 


0.191 


2.313 


1.980 


1.264 


1.330 


-1.168 


0.607 



In our simulations the mean values of the statistics considered for 
the two variables become closer but, in any case, the ranking is 
preserved. 

Very similar results, not reported here, have been obtained 
by using more complex experimental designs by considering more 
than two explanatory, possibly correlated, variables. Again, the 
test is able to distinguish among relevant and irrelevant inputs. 

To get a first impression on the distribution of the relevance 
measures, we focus our attention on the measure Mj = Mean ( D t j ) 
since its simple structure suggests a possibly asymptotic Gaussian 
distribution. To assess departures from this model, in figure 1 we 
reported the Q-Q normal plots with envelope (Atkinson, 1985) of 
the sampling distribution of the relevance measure when m = 3, 
for the three models and for the different values of L. To produce 
the plots, we generated 19 samples of the same size from a normal 
distribution and we scale all samples to mean 0 and variance 1, 
in order to remove dependence on location and scale parameters. 
Then, each sample is sorted and for each order statistic, the max- 
imum and the minimum values for the generated samples form 
the upper and the lower envelopes. Afterwards, the envelopes are 
plotted on the Q-Q plot of the scaled sampling distribution and 
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Fig. 1 . Q-Q normal plots with envelope of the sampling distribution of the relevance 
measure Mj = Mean ( Dtj ) when m = 3, for the three models and for the different 
values of L. 



form a guide to what constitutes serious deviations from the ex- 
pected behavior under normality. 

Looking at figure 1, it is quite clear that the Gaussian distribu- 
tion is a reasonable approximation for the distribution of the test 
statistic and so it can be used as a guideline for significance test- 
ing. Therefore, as a rule of thumb, values of the statistic greater 
than 2 can be considered as significant at the usual level of 5% 
and the corresponding variable is assumed to be ’’relevant”. 

5 Concluding Remarks 

Neural networks have been considered mainly ” ‘black-boxes’” and 
”‘data-mining”’ tools without particular emphasis on the statis- 
tical issues related to model identification and interpretability. In 
this paper, we have focused on a methodology useful to identify 
explanatory variables in a regression framework with dependent 
errors. In particular, we have proposed the use of the moving block 
bootstrap to estimate the standard error of some relevance mea- 
sures. The approach can be used as an exploratory tool to select a 
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proper set of inputs for neural network models and it seems effec- 
tive to produce a correct ranking among relevant and irrelevant 
variables. 

The results of a small Monte Carlo study are quite encouraging 
but, to get a deeper insight, several different aspects should be 
further explored. 

Firstly, if interest is in constructing significance test proce- 
dures based on the relevance measures, the sampling distribution 
of the test statistics under the null hypothesis (that the chosen 
variable is not relevant) is needed. For the mean derivative rele- 
vance measure case, a standard normal approximation seems to 
be a reasonable choice. In the general case, the derivation of the 
sampling distributions for model adequacy statistics is still an 
open issue. 

Secondly, several alternative nonparametric resampling schemes 
could be employed to get a measure of sampling variability or 
to construct approximate testing procedures. The performance 
of these techniques should be evaluated and compared with the 
blockwise bootstrap proposed here. 
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Abstract. One of the most important and broadly accepted solutions for the cen- 
tre selection problem in radial basis function networks is due to Moody and Darken 
(1989), who propose to cluster the input vectors via the A;-means clustering method. 
In this paper, an alternative robust solution, based on the spatial median, is pro- 
posed and compared also to the A;-medoids method (Kaufman and Rousseeuw, 
1990). Some simulation studies in the context of classification problems show that, 
when outlying data are present, the solution we propose improves the network 
performance and allows to define a more parsimonious network. 



1 Introduction 

Artificial neural networks are a particularly rich class of non linear 
models developed by cognitive scientists, often used in regression, 
classification and time series prediction. This modelling approach 
is very useful for many practical problems when the underlying 
laws governing the systems from which the data are generated are 
unknown. In this context, radial basis function networks (RBF 
networks) provide an attractive alternative to other feedforward 
networks because of their faster learning capability. 

RBF networks can be represented as single-hidden-layer feed- 
forward neural networks, in which the activation functions of the 
hidden layer units are radially symmetric ( radial basis functions ) , 
often Gaussians, each giving a significant response only in a local 
region of the p-dimensional input space (see Bishop, 1995 or Rip- 
ley, 1996 for more details). The transformation performed by the 
j-th output unit can be represented as a weighted sum of k basis 
functions, each computed by a hidden unit, as follows 

k 

9j ( x ) = oijo + ^2a jh (/) h (||x - [x h ||) 

h — 1 



(i) 
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where x € RP is an input vector, <fh is the h - th basis function 
with centre vector Hh € RP , ctjh is the weight of the connection 
between the h - th hidden unit and the j-th output one, and || • || 
denotes the Euclidean norm. The constant term aj 0 is added to 
compensate for the difference between the average value over the 
data set of the basis function activations and the corresponding 
target values. 

In the Gaussian case, the activation function of the h - th hid- 
den unit is 



<t>h (||x - Hh\\) = exp ^-^2 ll x - VhW 2 ^ (2) 

It is defined by the centre yL h and by the width or scale parameter 
<Jh e R + , which influences the degree of smoothness of the output 
function. 

For fixed centres and fixed widths, the coefficients of the linear 
combination (1) can be estimated by least squares. This suggests 
to estimate the network parameters following a two steps proce- 
dure: in the first step the location and scale parameters of the 
basis functions are determined, in the second one the weights are 
estimated according to the least squares optimality criterion. 

Of crucial concern is the first stage of the estimation process 
since the basis function centre locations strongly influence the net- 
work performance. Unless irrelevant variables are present (Pillati, 
2001), the basis function centres should be learned by unsuper- 
vised procedures in order to reflect the data distribution in the 
input space. 

One solution is to locate the centres at a collection of training 
elements, selected by random sampling or via subset selection or 
orthogonal least squares. As an alternative, the use of clustering 
techniques to find a set of centres which more accurately reflects 
the distribution of the data in the input space has been proposed 
(see Bishop, 1995, for a review). 

The theoretical foundations of the clustering based estimation 
methods may be found in Poggio e Girosi (1990), who showed 
that a gradient descent approach used to update the RBF centres 
actually ”... makes the centres move towards the majority of the 
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data, to find the position of the clusters” . Therefore, basis func- 
tions with localized responses should be placed in regions of data 
concentration. 

Moody and Darken (1989) proposed the A;-means clustering al- 
gorithm (MacQueen, 1967) to optimally place RBF centres and a 
heuristic nearest neighbour method to determine the scale param- 
eters. They estimated Ph {h=l,...,k) through the cluster centroids, 
each basis function being representative of a cluster of training 
data. Unfortunately, this clustering method is sensitive to outly- 
ing data, thus yielding a set of locations which fails to reflect how 
the training set is scattered in the input space. 

To overcome this problem, as an alternative to outlier deletion, 
we explore the possibility of robustly estimating the centres of the 
radial basis functions. 

Among robust competitors of the mean vector in estimating 
location parameters in the multivariate context, we focus our at- 
tention on the so called medoid (Kaufman and Rousseeuw, 1990), 
and on the spatial median (Gower, 1974), both derived as ’’pro- 
totypes” in robust clustering algorithms. More precisely, we ap- 
ply the PAM ( Partitioning Around Medoids ) algorithm to derive 
k medoids and propose a clustering algorithm which selects the 
cluster spatial medians as prototypes. 

We evaluate the benefits of employing the proposed method 
by testing the performance of the spatial median based networks 
through some simulation experiments in the context of classifica- 
tion problems. The results, shown in the following, highlight that, 
when outlying data are present, the proposed solution improves 
the network performance and allows to define a more parsimo- 
nious classifier. 



2 Multivariate Location Measures and 
Outlying Data 

The fc-means clustering method looks for a partition of the data 
set into k nonempty subsets and for k representative points that 
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minimise the following objective function 

k n 

J ^ ^ ^ ^ ^cz | |^z V c | | (3) 

C— 1 Z = 1 

where v c e R p denotes the representative point or ” prototype” of 
cluster c, and u C i € {0, 1} indicates if x* is a member of cluster c 
(u C i = 1 ) or not. The problem is to minimise J with respect to 
the sets U = {u C i} and V = {v c } under the following constraints 

k 

= 1 VM = l,...,n (4) 

C— 1 

n 

0<]T> cl <n Vc, c = 1, k (5) 

Z— 1 

The solution may be approximated by iteratively minimising the 
functional J in each set of grouped coordinates: first V is fixed, 
and J is minimised with respect to U ; then, given these par- 
tial solutions, J is minimised with respect to V (Bobrowski and 
Bezdek, 1991). The procedure is repeated until there are no fur- 
ther changes in the grouping. 

Denoted with D(xj,v c ) the ’’dissimilarity” between a data 
point Xj and a prototype v c (in (3) D(xj,v c ) is the squared Eu- 
clidean distance), it may be shown that the solution to the first 
optimisation step consists in assigning any unit i to the cluster c* 
if D(xj, v c .) = minD(xj, v c ) , i.e. to the cluster whose prototype 
is nearest to Xj. In the second step, the prototypes are optimally 
chosen by minimising the within clusters sum of dissimilarities: in 
(3) this yields the cluster centroids. 

As all variance minimisation techniques, the £;-means cluster- 
ing algorithm performs poorly when outlying data are present, 
because of their effect on the value of the function J to be min- 
imised. 

More precisely, as it yields a set of k joint location least squares 
estimators, it tends to isolate ’’large outliers”, which therefore 
will be taken as centres of some basis functions. Even if outlying 
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data may be interesting in exploratory data analysis, since they 
may represent some meaningful isolated clusters, they may play 
a different role when clustering techniques are used in the basis 
function optimisation step. In fact, they may lead to locate some 
basis functions in low density regions of the input space, and not 
in the bulk of the data, where the system could benefit from them 
more. Moreover, even if outliers are not isolated by the clustering 
algorithm, they still will affect the centre estimates, moving some 
centroids away from the bulk of the corresponding clusters. 

As a result, if outliers are present the &-means algorithm leads 
to a set of centres which fails to reflect how the training set is scat- 
tered in the input space. In order to cope with this problem, a 
possible approach is to detect and remove outliers before apply- 
ing the clustering procedure (Lay and Hwang, 1999). As an alter- 
native to outlier deletion, we explore the effectiveness of robust 
clustering methods to optimally locate RBF centres. 

One of the most important contributions on robustness in clus- 
ter analysis is known as k-medoids method, implemented in the 
program PAM ( Partitioning Around Medoids ) described in Kauf- 
man and Rousseeuw (1990). 

It could represent an alternative solution to the problem of 
selecting a representative collection of training elements as basis 
function centres. In fact, the clustering algorithm is based on the 
search for k representative objects (called medoids) in the data 
set, which should represent various aspects of the structure of the 
data. As suggested from the authors, the algorithm is especially 
recommended if one is interested in the representative objects 
themselves. 

This clustering solution is more robust with respect to outliers 
because it is based on the minimisation of an objective function 
J involving the Euclidean distance or the Manhattan distance 
instead of the squared Euclidean distance. Thus, the fc-medoid 
approach searches for k representative objects v c in the data set 
such that the function 



k n 

J = ^2^2u ci D(xi,v c ) 

C— 1 i — 1 



(6) 




126 



Pillati and Calo 



is minimised, where D is the Euclidean distance or the Manhattan 
one. In (6) « ci is set equal to 1 if v c is the medoid closest to x*. The 
corresponding k clusters are found by assigning each remaining 
object to the nearest representative one. Thus, the average dis- 
tance of the representative object to all the other elements of the 
same cluster is minimised. For this reason, such an optimal repre- 
sentative object is called medoid of its cluster. However, this leads 
to outlier isolation in most situations (Kaufman and Rousseeuw, 
1990). 

Even if we use the objective function (6) and allow the k 
representative objects to be replaced by general p-dimensional 
points (Cooper, 1963; Spath, 1980), the k joint location mea- 
sures (k medians) we obtain are non robust too. In fact, although 
the median is a robust centralisation measure, the selection of k 
’’joint” medians through this formulation is very unstable: it may 
be shown that the introduction of one sufficiently remote value 
implies the selection of such a value as one of the representative 
points (Cuesta- Albertos et al . , 1997). 

Figure 1 shows the behaviour of these ’’joint” location mea- 
sures, and the corresponding clustering process, on a artificial 
one-dimensional data set, with an obvious two cluster structure. 
While in (a) the clusters are perfectly detected and the two me- 
dians reflect the data structure, in ( b ) the clustering process is 
derailed by the presence of only one outlier, which is selected as 
one of the two representative points. 



(«) A — A — 

-6 -5 -4 -3 -2 



-6 ... -2 2 ... 6 



2 3 4 5 6 



50 



Fig. 1 . Illustration of A;-medians clustering (k = 2) in a one-dimensional data set 
(Garcia-Escudero et al. 1997). Different simbols (square and dot) denote the mem- 
bership to different clusters, and the representative points are black. 
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3 The fc-Spatial Median Estimates of the 
RBF Location Parameters 

Taking into account that medians are robust competitors of the 
mean vector in estimating location parameters in the one-sample 
problem, we propose an iterative relocation algorithm, which is 
based on a particular notion of multidimensional median. 

As an alternative to the exhaustive search involved in the min- 
imisation of (6), we suggest to approximate the solution by iter- 
atively minimising the objective function in each set of grouped 
coordinates, as in the £;-means algorithm. As it will be explained 
in the following, first U is fixed, and J is minimised with respect 
to {v c }; then, given these partial solutions, J is minimised with 
respect to {u C i} by assigning any unit i to the cluster whose pro- 
totype is nearest to x*. 

Given U, the prototypes are optimally chosen by minimising 
the within cluster sum of dissimilarities, but a different choice 
for D(xj,v c ) leads to a different set of representative objects. If 
D(xj,v c ) is the Manhattan distance the solution is given by the 
cluster median vectors; if D(xi,v c ) is the Euclidean distance the 
second step solution leads to the cluster spatial medians. 

In fact, given a set of n points in RP, xi, x 2 , ..., x n , its spatial 
median is the vector 0 that minimises the sum of the Euclidean 
distances to the points, in direct analogy to the univariate sample 
median, that is 

n 

0 = argmin ||xj - 0|| (7) 

% — 1 

We focus our attention on it for several reasons. It is invariant to 
rotations of the axes and thus preferable to the non invariant vec- 
tor of the marginal medians. A geometrical interpretation shows 
that the spatial median is naturally applicable to spherically sym- 
metric models (Small, 1990). Furthermore, following Vardi and 
Zhang (1999), the spatial median of a p-dimensional data cloud 
can be computed through a fast algorithm, which is extremely 
simple to program and is guaranteed to converge monotonically 
to the solution from any starting point in RP . 
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As any iterative relocation algorithm, the one we propose re- 
quires an initial starting point. Starting from a random partition 
of the training set, and not from a random sample of k objects, 
the robustness of the spatial median guarantees that the extreme 
values are never selected as prototypes during the iterations of 
the procedure. 

Figure 2 shows the behaviour of the iterative algorithm in 
the same situation reported in Figure 1(6). As opposite to the 
£;-means clustering, the proposed algorithm does not reach the 
global solution of the minimisation problem, which corresponds 
to forcing one median to catch the outlier. It stops in a local 
minimum, yielding a set of cluster prototypes which represents 
a robust solution to the RBF centre location problem. As it can 
be seen in Figure 2, the iterative algorithm leads to the same 
representative points obtained by the &-means algorithm or by 
PAM when no outlier is present (Figure 1(a)). 

-AA±AA 00#00 O- 

-6 ... -2 2 ... 6 50 



Fig. 2. Output of the proposed clustering procedure in the artificial data set of 
Figure 1(b). 



4 Simulation Study and Concluding Remarks 

The robustness of the proposed method against outlying obser- 
vations has been tested on several data sets in the context of 
classification problems. Its performances have been compared to 
those of two RBF networks, which differ in the centre estimation 
method: the &-means based classifier and the &-medoids based 
one. 

The estimation of the scale parameters followed a nearest 
neighbour based method, as suggested in Moody and Darken 
(1989). The procedures have been implemented in S-PLUS, re- 
sorting to the algorithm proposed by Vardi and Zhang (1999) in 
the computation of the spatial median and to the S-PLUS func- 
tion PAM in the selection of the medoids. 
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The first simulation experiment deals with a 2-class prob- 
lem in 6 dimensions. Class 1 is drawn from a uniform mixture 
of a multivariate unit normal with vector mean (a,..., a) and 
a multivariate unit normal with vector mean (— a,...,— a ), with 
a = 2/^6 . Class 2 is generated from a unit normal with vector 
mean (a, —a, —a). 

A contaminated setting was simulated by drawing 10% of the 
observations in Class 2 from a distribution which differs from the 
original one only in the standard deviation (10 times the previous 
one). 

Both in the non contaminated situation and in the contami- 
nated one, 50 training sets of 300 observations each were gener- 
ated. The results of these two simulation experiments are illus- 
trated in Table 1 and Table 2, respectively. They show, for each 
classifier, the mean and the standard deviation of the 50 misclas- 
sification error rates evaluated on a test set of 3000 observations. 



Table 1. Mean error rates and standard errors (in brackets) of RBF networks with 
different centre estimation methods in the 2-class problem (no outlying observa- 
tion). 



k 


k-means RBF 


k-medoids RBF 


k- spatial medians RBF 


2 


0.3305 (0.0425) 


0.3050 (0.0612) 


0.3270 (0.0242) 


3 


0.1286 (0.0034) 


0.1582 (0.0250) 


0.1269 (0.0027) 


4 


0.1260 (0.0036) 


0.1317 (0.0082) 


0.1237 (0.0043) 


5 


0.1244 (0.0045) 


0.1281 (0.0088) 


0.1232 (0.0048) 


6 


0.1228 (0.0051) 


0.1248 (0.0061) 


0.1209 (0.0047) 


7 


0.1215 (0.0051) 


0.1220 (0.0065) 


0.1200 (0.0050) 


8 


0.1188 (0.0046) 


0.1208 (0.0065) 


0.1163 (0.0053) 


9 


0.1173 (0.0047) 


0.1196 (0.0054) 


0.1163 (0.0048) 


10 


0.1172 (0.0050) 


0.1172 (0.0055) 


0.1150 (0.0040) 



The results show that in absence of outlying observations all 
the classifiers yield similar error rates for various k values, even 
if the choice of a subset of training elements through PAM seems 
to slightly worsen the network performances, unless k = 2 where 
the three classifiers perform poorly. Due to the three-modal pop- 
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ulation structure, the use of more than three basis functions does 
not reduce sensibly the test set error rates for all the classifiers. 

As expected, the fc-means based RBF network proves to be 
derailed by outlying observations, while the estimate solution we 
propose seems to be affected by outliers to a lesser extent. Al- 
though the performances of both classifiers get better as the num- 
ber k of hidden units increases, the spatial median based network 
achieves good results with a simpler architecture. As in the non 
contaminated case, as few as three basis functions are enough to 
get performances quite similar to those obtained with more hid- 
den units. 



Table 2. Mean error rates and standard errors (in brackets) of RBF networks 
with different centre estimation methods in the 2-class problem (10% of outlying 
observations in Class 2). 



k 


k-means RBF 


k-medoids RBF 


k-spatial medians RBF 


2 


0.4054 (0.0524) 


0.3467 (0.0769) 


0.3821 (0.0535) 


3 


0.2915 (0.1244) 


0.1685 (0.0426) 


0.1382 (0.0050) 


4 


0.2019 (0.0834) 


0.1538 (0.0205) 


0.1322 (0.0060) 


5 


0.1550 (0.0395) 


0.1460 (0.0176) 


0.1280 (0.0063) 


6 


0.1367 (0.0187) 


0.1426 (0.0153) 


0.1275 (0.0058) 


7 


0.1313 (0.0066) 


0.1391 (0.0132) 


0.1274 (0.0061) 


8 


0.1272 (0.0058) 


0.1351 (0.0106) 


0.1266 (0.0065) 


9 


0.1257 (0.0056) 


0.1330 (0.0107) 


0.1246 (0.0062) 


10 


0.1238 (0.0057) 


0.1303 (0.0106) 


0.1248 (0.0064) 



Table 3 and Table 4 report the results of another simulation 
study, dealing with the ’’waveform data” (Breiman et al, 1984). It 
is a 3-class problem in 21 dimensions. The classes have equal prob- 
ability. The test set consists of 3000 observations. Fifty training 
sets of 600 observations each were generated. The misclassifica- 
tion error rate of each classifier has been estimated by averaging 
the 50 error rates on the test set. 

In the contaminated setting (Table 4), 10% of the observations 
in each class has been generated from a distribution which differs 
from the original one only in the standard deviation (15 times as 
much as it was at the beginning). 
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Table 3. Mean error rates and standard errors (in brackets) of RBF networks with 
different centre estimation methods in the 3-class (no outlying observation). 



k 


k-means RBF 


k-medoids RBF 


k-spatial medians RBF 


2 


0.3712 


(0.0088) 


0.3712 


(0.0217) 


0.3659 


(0.0095) 


3 


0.1637 


(0.0107) 


0.1828 


(0.0134) 


0.1649 


(0.0113) 


4 


0.1597 


(0.0137) 


0.1862 


(0.0119) 


0.1601 


(0.0099) 


5 


0.1470 


(0.0087) 


0.1798 


(0.0110) 


0.1490 


(0.0120) 


7 


0.1394 


(0.0045) 


0.1690 


(0.0080) 


0.1400 


(0.0050) 


9 


0.1397 


(0.0040) 


0.1634 


(0.0065) 


0.1393 


(0.0038) 


11 


0.1400 


(0.0041) 


0.1597 


(0.0068) 


0.1403 


(0.0041) 


13 


0.1406 


(0.0047) 


0.1579 


(0.0059) 


0.1401 


(0.0048) 


15 


0.1412 


(0.0044) 


0.1554 


(0.0062) 


0.1407 


(0.0052) 



Table 4. Mean error rates and standard errors (in brackets) of RBF networks with 
different centre estimation methods in the 3-class problem (10% of outlying data). 



k 


k-means RBF 


k-medoids RBF 


k-spatial medians RBF 


2 


0.3767 


(0.0108) 


0.3841 


(0.0144) 


0.3726 


(0.0080) 


3 


0.2542 


(0.0782) 


0.1928 


(0.0218) 


0.1814 


(0.0166) 


4 


0.2190 


(0.0744) 


0.1861 


(0.0197) 


0.1779 


(0.0206) 


5 


0.1836 


(0.0466) 


0.1818 


(0.0154) 


0.1570 


(0.0180) 


7 


0.1703 


(0.0271) 


0.1796 


(0.0145) 


0.1431 


(0.0055) 


9 


0.1663 


(0.0179) 


0.1796 


(0.0138) 


0.1412 


(0.0050) 


11 


0.1678 


(0.0168) 


0.1787 


(0.0130) 


0.1402 


(0.0047) 


13 


0.1626 


(0.0108) 


0.1779 


(0.0126) 


0.1400 


(0.0039) 


15 


0.1584 


(0.0142) 


0.1785 


(0.0115) 


0.1406 


(0.0042) 



This simulation experiment confirms that in the non contami- 
nated setting the £;-means based classifier and the spatial median 
based one have similar performances, but the latter outperforms 
the former in presence of extreme observations. Due to the fact 
that the RBF centres must be selected from the training set, the 
&-medoid based classifier seems to be less flexible than the oth- 
ers. However, if few hidden units are considered, the presence of 
outliers makes it preferable to the A;-means based classifier. 

The simulation results seem therefore to confirm that, when 
outlying data are present, the solution we propose improves the 
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network performance, even with regard to its stability, and allows 
to define a more parsimonious network. 
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Abstract. The exponential power function distributions are a natural generaliza- 
tion of the normal distribution. A slight arrangement of their parameters originates 
the normal distributions of order p. The two formulations are fairly equivalent, and 
we shall use the second one because many properties, related to scale and shape, 
for instance, may be written in a simpler way. The parameter estimation of such 
distributions, however, is rather difficult in practice as non-linear equation systems 
have to be solved. In addition, the estimates that may be defined are often either 
biased or have large variance. In this paper we introduce a genetic algorithm for 
estimating the location and scale parameters, in the presence of small sample size, 
when the shape parameter p is assumed known. Then, if the sample size is mod- 
erate, a procedure based on genetic algorithms for estimating simultaneously the 
location, scale and shape parameters will be developed. A simulation experiment is 
carried out to illustrate the advantages and the merits of the proposed procedure 
compared with some well established methods. 



1 Introduction 

Subbotin (1923), generalizing the second Schiaparelli axiom, in- 
troduced the class of distributions named exponential power func- 
tions (EPF) as a generalized random error structure. Several for- 
mulations of the EPF were extensively analyzed in the literature 
(Diananda, 1949; Box, 1953; Turner, 1960; Box and Tiao, 1973), 
but here we refer to the model known as normal distributions of 
order p (Vianelli, 1963; Lunetta, 1963) 

f {x-,n,a p ,p) = {2p 1/p a p r (1 + 1/p)} 1 exp j- (p<x£) _1 l x - ^| P } > 

(1) 

where — oo < x < + 00 , —00 < p < + 00 , 0 < a p < 00 , 0 < 
p < 00 . The density (1) is completely specified by the location 
p, the scale a p and the shape parameter p. A family of unimodal 
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Fig. 1. Example of particular cases for normal distributions of order p. The location 
parameter is assumed fi = 0, and the scale a p = 1. 



symmetric curves from the density function (1) arise for different 
values of p, with shape varying from leptokurtic (0 < p < 2) 
to platykurtic (p > 2) distributions. The Laplace ( p = 1), the 
Normal (p = 2), and the Uniform (p — > oo) distributions may be 
obtained as special cases. In the Figure 1, some curves are plotted 
where different values for p were assumed, while p was set to zero 
and a p was set to unity. 

Let x=(xi, X 2 , • • • , x n ) be an i.i.d. random sample drawn from 
this probability density function. The maximum-likelihood (ML) 
estimator of the parameter vector 9=(p, o p , p) is any value 9 which 
maximizes the likelihood function 



L(9,x.) = {2p l/p a p r(l + 1/p)} "expj— (p<r£) 

( 2 ) 

Properties and performances of the parameter estimation meth- 
ods for the distributions in this class were extensively studied by 
Mineo (1989), Mineo (1994), Agro’ (1995), Burgio and Nikitin 
(1998), Chiodi (2000). This paper proposes a genetic algorithm 
(GA) for estimating 9 based on the L ( 9 , x) function. Such ap- 
proach is believed effective to avoid several difficulties that occur 
when the common search numerical algorithms are used. Alter- 
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native estimation procedures, that we do not consider here, were 
proposed which combine the ML and moment estimate methods 
based on kurtosis index. These methods were found able to give 
well-known optimal results also for small samples (Mineo, 1994). 
Two scale parameter estimators (Chiodi, 1988; La Tona, 1994) 
were introduced to reduce the bias of the ML scale estimator 
when p was supposed known. In fact, the usual correction of the 
degrees of freedom for variance computation does not work for 
p / 2, as the location parameter is replaced by its ML estimate. 

Holland (1975) studied the evolution of a population of indi- 
viduals in a given environment by means of a class of analytical 
models called GA. According to this approach, each individual 
in the population, which has not to be given a strictly statisti- 
cal meaning, needs to be characterized by a chromosome, i.e. a 
string of characters. Each character in the string possesses a spe- 
cific meaning that may be decoded from both its value, called 
allele, and position, called locus. The population evolves, by dif- 
ferent genetic operators, through several generations, towards the 
best adaptation to the environment. A basic assumption is that 
a function may be defined that maps the genetic pool (the set 
of all admissible strings) into the positive real axis. This func- 
tion is known as fitness function (//). The // is to be computed 
for each individual, and it has to increase as soon as the adap- 
tation to the environment increases. The special applications of 
the GA for function optimization assume the objective function 
as the //, and the string owned by the best fit individual as 
the solution to the optimization problem. The effectiveness of the 
GA for optimization may be enforced by using the so-called elitist 
strategy, where the best individual fitness function value is always 
recorded, and possibly updated in each generation. This approach 
is similar to the simulated annealing (SA) application to function 
optimization (Brooks and Morgan, 1995). Using GA for statistical 
inference was investigated by Chatterjee and Laudatto (1997). 

The outline of the paper is as follows. The next section is 
devoted to the illustration of the GA we implemented for assess- 
ing estimates and standard errors. Comparison with alternative 
approaches by means of a simulation experiment is provided in 
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Section 3. The estimation of the location and scale parameters 
in case of small samples, for given order p , and the estimation of 
all three parameters, for rather large sample sizes, is considered. 
Conclusions are drawn in the last section. 

2 Genetic Algorithms Implementation 

The main features of our GA design are the tournament selection 
for reproduction, and three operators, crossover, inversion, and 
mutation. Unlike many GA applications, inversion proved to con- 
stitute a valid support to increase the speed of convergence to the 
solution. Encoding the potential solutions, that are real numbers 
in a given interval (a, b), say, was done by using the following de- 
vice. Let us choose a positive integer l as the binary string length. 
Then, 2 £ dilferent numbers x may be coded according to 

x = a + c(b — a)/(2 l — 1), (3) 

where c represents the real value corresponding to the given binary 
string. As c may vary from 0 (the null string) through 2^ — 1 (the 
all-one string), then (3) encodes a finite subset of real numbers 
in (a, b), at equispaced intervals of length ( b — a)/( 'l 1 — 1) each. 
The code (3) was used for each of the three parameters to be 
estimated. The integer l was taken fixed, whilst three intervals 
(a, b) were selected according to the parameter constraints (e.g. p 
has to be positive), some preliminary plots of artificial data and 
suggestions from previous studies (see, for instance, Agro’ (1995; 
1999)). 

Reproduction. Let the so-called pressure selection probability 
p s be chosen which determines the rate at which individuals with 
high fitness spread into the population. A set of s potential solu- 
tions are generated at random as binary strings of length £. We 
chose l — 32 in our application. The s individuals in the popu- 
lation are paired at random, then each pair is examined. A copy 
of the string of the individual for which a fitness function greater 
than the other is computed replaces the string of the less fit in- 
dividual with probability p s . So, there is a chance of 1 — p s that 
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the two individuals remain unchanged. Note that, if s is odd, an 
individual is left alone and remains unchanged. 

Crossover. Though many crossover operators were available, 
for problems of this kind the single-point crossover seemed the 
best choice. The s individuals, after the reproduction step, are 
again paired at random, excepted one at most if s is odd. This 
latter remains unchanged as before. A positive integer into the 
range (l,£—l),j say, is selected at random. Then the characters 
from the (j + l)-th are exchanged between the two individuals. 
The crossover operator applies to a pair with probability p c . 

Inversion. Each of the s individuals, after crossover took place, 
is considered in turn, and its string is processed by the inversion 
operator with probability Pi. Now, within the range (1, £), two cut 
points are selected at random, and the characters in between are 
to be taken in reverse order. 

Mutation. All characters in the whole genetic pool are allowed 
to change their values with probability p m . The choice of p m was 
proven to influence the performance of the GA considerably. A 
high p m may serve to maintain the diversity between the individ- 
uals into the population, but this is likely to produce premature 
convergence as well. 

For the present problem, most figures recommended by Chat- 
terjee and Laudatto (1997) were assumed, i.e. p s = 0.75, p c = 0.7, 
Pi = 0.65, and 50 as the number of generations. We considered 
instead the population size rather small, s = 50, as no substantial 
improvement was observed by taking larger values. Figures were 
suggested for p m in the range 0.001 through 0.1, but a right bal- 
ance for the present problem seemed p m = 0.01. We also assumed 
the L ( 6 , x) in (2) as the //. The vector estimate 0 was decoded 
from the chromosome of the best individual found by the GA. 



3 Simulation and Results 

Two different Monte Carlo simulation studies were designed, where 
the shape parameter p was distinctively supposed known and un- 
known. Through an algorithm of pseudo-random numbers (Chiodi, 
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1986), 100 samples simulated of size n = 10,20,50, for p sup- 
posed known, and of sample size n — 50, 100, with p unknown, 
were generated from a standardized normal distribution of order 
p , with p = 1, 1.25, 1.5, 1.75, 2, 2.25, 2.75 and 3.25. Each standard- 
ized value was afterwards consequently transformed, by consider- 
ing a normal distribution of order p with 6=(3,2,p). We have 
considered larger sample sizes in the p unknown case because of 
the difficulty to reach the maximum of the function (2) , for sample 
size n < 100 and for some p estimated values (Agro’, 1995). For 
p known, the vector 0 to be estimated will be composed by p and 
a p . We also underline that even if the hypothesized situation of ”p 
known” is unreliable one, we took it into account because many 
results, concerning, for instance, approximated sampling test dis- 
tributions (Lunetta, 1966; Chiodi, 1994 and 2000), are still mainly 
based on this assumption. 

As far as the first simulation experiment is concerned, four 
algorithms were compared. The first one is the ML procedure 
proposed by Chiodi (1988). The second one is essentially the jack- 
knife estimator described in La Tona (1994). Then, the GA was 
used. The fourth algorithm is a SA procedure which is aimed at 
providing several starting points for a quasi-Newton algorithm. 
It was designed as a numerical optimization tool for maximizing 
the likelihood function. The results are displayed in Table 1. The 
sample means and the standard errors of the estimated vector 
0 were computed for each p value and sample size by using the 
procedures listed above. The least absolute bias, averaged over 
all the (p, n ) pairs, for the location parameter p, is obtained by 
the ML algorithm. The GA minimizes instead the average abso- 
lute bias for the scale parameter a v . It seems also interesting to 
distinguish the results according to the sample size. For n = 10, 
the least absolute bias, averaged over all values of the shape pa- 
rameter p, may be computed from the GA estimates of both p 
and Op. Nonetheless, the least standard errors, on the average, 
may be obtained, for p , by the jackknife estimator, and, for o p , 
by the ML one. As far as the sample size n = 20 is concerned, the 
least average absolute bias and standard errors may be obtained 
by using the jackknife estimator, for p, and by the GA, for o p . In 
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case of sample size n = 50, for both location and shape parame- 
ters, the average absolute bias is minimized by the ML estimator, 
and the standard error of the estimates by the GA. Another kind 
of comparison may turn of interest by distinguishing the values 
of the shape parameter p in two sets, the first one including the 
figures from 1 through 2, the second one those from 2.25 through 
3.25. In the latter case, the GA was able to give, by considering 
the average absolute bias, the best results for both parameters. 
In the former one, the ML algorithm provided best results for p, 
and the GA for a p . 

The results of the simulations, for p unknown, are given in Ta- 
ble 2, where the sample means and the standard errors of the es- 
timated vector 9 were computed for each p value and sample size, 
by using the ML and the GA approaches. The number K of sam- 
ples, for which the ML procedure has not reached any optimum 
point of (2), is also reported in the last column. For all sample 
sizes and values of p, the sample mean of p and a p are very close to 
their true values defined in 9 for each approach. More precisely, 
the results obtained by the GA approach are more likely than 
the ML approach when p > 2, except some rare cases. The same 
conclusions do not work always for p < 2. About the shape pa- 
rameter p, its true values were over estimated by its sample mean 
for both approaches, even if the results by the GA approach were 
always less biased than the ML approach, for p greater than 2. 
The results for p < 2, however, have to be interpreted with cau- 
tion, because of the samples discarded by the ML procedure. No 
substantial differences about the standard errors of the estimates 
were found. Furthermore, we tried a SA procedure on the same 
problem (p unknown), and similar results were obtained. 



4 Conclusions 



Some procedures for the estimation of the parameters of an or- 
der p normal distribution are compared by means of two distinct 
Monte Carlo simulations. The numerical inspection of the sample 
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Table 1 . Order-p normal distribution parameter estimates, p known 







Max. likeli. 


Jackknife 


Genetic alg. 


Sim. anneal. 


p 


n 


P 


CTp 


P 


O p 


P 


O p 


P 


O p 


1 


10 


2.9917 


1.8810 


2.9883 


2.1404 


3.0100 


1.9800 


2.9641 


2.3512 






(.6917) 


(.5415) 


(.6011) 


(.6393) 


(.7275) 


(.5700) 


(.7062) 


(.6769) 




20 


2.9819 


1.9523 


2.9819 


2.0684 


2.9815 


2.0116 


2.9562 


2.1792 






(.4879) 


(.3587) 


(.4879) 


(.3777) 


(.5410) 


(.3492) 


(.5342) 


(.3783) 




50 


3.0265 


1.9948 


3.0265 


2.0426 


3.0047 


2.0157 


3.0027 


2.0787 






(.2841) 


(.2475) 


(.2841) 


(.2665) 


(.2937) 


(.3036) 


(.3021) 


(.3131) 


1.25 


10 


3.0241 


2.0346 


3.1013 


2.0271 


2.9859 


1.9673 


2.9860 


2.1479 






(.7358) 


(.5960) 


(.6526) 


(.5932) 


(.7049) 


(.5429) 


(.7049) 


(.5927) 




20 


3.0292 


2.0314 


3.0283 


2.0361 


3.0434 


1.9615 


3.0430 


2.0442 






(.5014) 


(.4402) 


(.4983) 


(.4421) 


(.5988) 


(.4391) 


(.5989) 


(.4576) 




50 


3.0088 


2.0031 


3.0100 


2.0048 


3.0350 


2.0079 


3.0356 


2.0402 






(.3397) 


(.2641) 


(.3391) 


(.2631) 


(.2533) 


(.2473) 


(.2535) 


(.2513) 


1.5 


10 


3.0684 


1.9882 


3.0709 


1.9907 


3.0824 


1.9685 


3.0823 


2.0559 






(.5888) 


(.5845) 


(.5906) 


(.5860) 


(.6297) 


(.5474) 


(.6296) 


(.5717) 




20 


3.0068 


2.0271 


3.0071 


2.0277 


3.0408 


2.0082 


3.0407 


2.0498 






(.4896) 


(.3651) 


(.4906) 


(.3675) 


(.5124) 


(.3879) 


(.5124) 


(.3960) 




50 


2.9763 


1.9748 


2.9764 


1.9750 


3.0307 


1.9605 


3.0306 


1.9761 






(.3028) 


(.2428) 


(.3024) 


(.2423) 


(.3195) 


(.2501) 


(.3194) 


(.2513) 


1.75 


10 


3.0205 


1.9401 


2.9829 


1.9340 


2.9185 


1.9898 


2.9185 


2.0240 






(.6755) 


(.4805) 


(.5856) 


(.4726) 


(.6552) 


(.4784) 


(.6552) 


(.4866) 




20 


3.0358 


1.9979 


3.0360 


1.9987 


2.9739 


1.9959 


2.9737 


2.0120 






(.4426) 


(.3580) 


(.4426) 


(.3583) 


(.4552) 


(.3739) 


(.4550) 


(.3769) 




50 


3.0420 


2.0067 


3.0421 


2.0069 


2.9734 


1.9676 


2.9735 


1.9738 






(.3076) 


(.2540) 


(.3076) 


(.2543) 


(.2836) 


(.1904) 


(.2837) 


(.1910) 


2 


10 


2.9352 


1.9226 


2.9352 


1.9226 


2.9203 


1.9471 


2.9202 


1.9471 






(.6260) 


(.4788) 


(.6260) 


(.4788) 


(.6543) 


(.4747) 


(.6540) 


(.4747) 




20 


2.9869 


1.9429 


2.9869 


1.9429 


2.9802 


1.9492 


2.9801 


1.9492 






(.4435) 


(.3453) 


(.4435) 


(.3453) 


(.4629) 


(.3158) 


(.4631) 


(.3158) 




50 


3.0014 


1.9987 


3.0014 


1.9987 


2.9671 


2.0206 


2.9668 


2.0206 






(.2622) 


(.2254) 


(.2622) 


(.2254) 


(.2932) 


(.1909) 


(.2932) 


(.1909) 


2.25 


10 


3.1134 


1.9122 


3.1134 


1.9102 


2.9456 


1.9938 


2.9455 


1.9707 






(.6496) 


(.4488) 


(.6496) 


(.4476) 


(.6050) 


(.4935) 


(.6051) 


(.4878) 




20 


3.0028 


1.9623 


3.0029 


1.9620 


2.9994 


2.0074 


2.9994 


1.9964 






(.3995) 


(.3057) 


(.3996) 


(.3056) 


(.3601) 


(.3131) 


(.3601) 


(.3114) 




50 


2.9908 


1.9774 


2.9908 


1.9773 


2.9741 


1.9893 


2.9742 


1.9850 






(.2841) 


(.1738) 


(.2841) 


(.1739) 


(.2933) 


(.2063) 


(.2929) 


(.2059) 


2.75 


10 


3.0725 


1.9358 


3.0842 


1.9309 


3.0003 


1.8633 


3.0003 


1.8149 






(.5466) 


(.3779) 


(.5373) 


(.3791) 


(.5563) 


(.3997) 


(.5563) 


(.3893) 




20 


3.0437 


1.9551 


3.0437 


1.9552 


3.0552 


1.9751 


3.0552 


1.9507 






(.4195) 


(.3009) 


(.4195) 


(.2996) 


(.3797) 


(.2765) 


(.3797) 


(.2731) 




50 


2.9726 


2.0271 


2.9726 


2.0273 


3.0183 


2.0233 


3.0180 


2.0136 






(.2231) 


(.1853) 


(.2231) 


(.1853) 


(.2255) 


(.1646) 


(.2256) 


(.1638) 


3.25 


10 


2.9288 


2.0104 


2.9278 


1.9990 


3.0621 


1.9547 


3.0622 


1.8874 






(.4649) 


(.3823) 


(.4655) 


(.3816) 


(.5005) 


(.4018) 


(.5004) 


(.3879) 




20 


3.0110 


1.8980 


3.0112 


1.8977 


2.9307 


1.9773 


2.9305 


1.9450 






(.3331) 


(.2656) 


(.3331) 


(.2647) 


(.3920) 


(.2538) 


(.3921) 


(.2497) 




50 


3.0093 


1.9245 


3.0093 


1.9239 


2.9443 


1.9957 


2.9442 


1.9830 






(.2071) 


(.1580) 


(.2071) 


(.1581) 


(.2079) 


(.1732) 


(.2080) 


(.1721) 
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Table 2. Order-p normal distribution parameter estimates, all parameters unknown 





Maximum likelihood 


Genetic algorithm 






pi O p 


V 


p o p 


P 


K 


P=1 


n=50 


3.0193 2.1429 


1.2462 


3.0267 2.1341 


1.2250 






(.2868) (.3198) 


(.3390) 


(.2796) (.3171) 


(.3279) 


10 


II 

H— 1 

o 

o 


2.9781 2.0606 


1.1144 


3.0074 2.0403 


1.1099 






(.2004) (.1917) 


(.1602) 


(.2169) (.1988) 


(.1762) 


5 


p=1.25 


n=50 


2.9765 2.0486 


1.5072 


3.0235 2.0528 


1.4813 






(.3578) (.3436) 


(.7016) 


(.3176) (.3237) 


(.6211) 


5 


n=100 


3.0447 2.0468 


1.3691 


3.0066 2.0593 


1.3774 






(.2181) (.2495) 


(.2954) 


(.2063) (.2820) 


(.3766) 


6 


p=1.50 


n=50 


2.9694 1.9805 


1.7014 


2.9812 2.0096 


1.8719 






(.3106) (.3483) 


(.7612) 


(.3167) (.3783) 


(.9849) 


2 


n=100 


2.9924 1.9949 


1.5935 


2.9922 2.0094 


1.5980 






(.1959) (.2634) 


(.3806) 


(.2009) (.2324) 


(.4399) 


1 


p—1.75 


n=50 


3.0454 2.1079 


2.2741 


2.9633 2.0760 


2.1561 






(.3220) (.4431) (1.0201) 


(.2962) (.3908) 


(.9725) 


2 


o 

o 

r— 1 

II 

pj 


2.9570 1.9861 


1.8479 


2.9894 1.9606 


1.8190 






(.1987) (.2150) 


(.3964) 


(.2111) (.2353) 


(.4464) 




P— 2 


n=50 


3.0064 2.0554 


2.5580 


3.0510 2.0858 


2.5986 






(.2908) (.3666) (1.1405) 


(.2963) (.3090) (1.0214) 


1 


n=100 


2.9938 2.0654 


2.2843 


3.0127 2.024 


2.2229 






(.2196) (.2362) 


(.7219) 


(.1815) (.2246) 


(.6650) 




p=2.25 


n=50 


3.0160 2.0807 


3.0267 


2.9586 2.0637 


2.8037 






(.2985) (.3049) (1.1726) 


(.2848) (.3257) (1.1711) 




n=100 


3.0314 1.9904 


2.4970 


2.9944 2.0054 


2.4778 






(.2077) (.2345) 


(.6559) 


(.1883) (.2125) 


(.7259) 




p=2.5 


n=50 


2.9395 2.0682 


3.2349 


2.9985 1.9875 


3.0510 






(.2562) (.2919) (1.1717) 


(.2670) (.3054) (1.2151) 




II 

o 

o 


3.0771 2.0807 


2.9473 


3.0098 2.0395 


2.7787 






(.1718) (.2123) 


(.8330) 


(.1661) (.2171) 


(.7440) 




p=2.75 


n=50 


2.9752 2.0459 


3.2150 


3.0213 1.9974 


3.2069 






(.2416) (.3026) (1.1268) 


(.2671) (.3146) (1.1257) 




n=100 


2.9778 2.0149 


3.0505 


2.9904 1.9498 


2.8491 






(.1846) (.2213) 


(.8972) 


(.1846) (.1966) 


(.7361) 




P— 3 


n=50 


2.9970 2.0584 


3.6761 


3.0009 2.0152 


3.5714 






(.2448) (.2661) (1.1710) 


(.2509) (.3022) (1.2258) 




n=100 


3.0188 2.0110 


3.4036 


3.0051 2.0365 


3.3715 






(.1733) (.2124) 


(.9240) 


(.1624) (.1970) 


(.8667) 




p=3.25 


n-50 


3.0054 1.9841 


4.0412 


3.0103 1.9782 


3.7364 






(.2199) (.2248) (1.0605) 


(.2260) (.2782) (1.2226) 




n=100 


2.9912 2.0242 


3.6140 


3.0063 2.0174 


3.5910 






(.1609) (.1932) 


(.9203) 


(.1639) (.2161) 


(.9309) 
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mean and standard errors of the parameter estimates was made 
by drawing 100 samples using of size n = 10, 20, 50, when p was 
supposed known, and n = 50, 100, when p was unknown, from a 
normal distribution of order p (p = 3; a p = 2) withp = 1, 1.25, 1.5, 
1.75, 2, 2.25, 2.75 and 3.25. When p was assumed known, the ML, 
jackknife, GA and SA estimators were considered for estimating 
the location and scale parameters p and a p . Though not clearcut 
results were obtained, in most cases the GA seemed able to en- 
sure better performances than the other procedures with respect 
to both the bias and the standard errors of the estimates. For the 
p unknown case, the GA yields estimates of p less biased than 
the ML approach when p > 2. Opposite results occur if p < 2. 
In this case, however, some samples were discarded by the ML 
procedure. Similar findings were obtained, except some particular 
cases, for the estimates of the two other parameters. 
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Abstract. The paper presents an approach to modelling and forecasting the In- 
dustrial Production Index which allows to account for the presence of asymmetric 
effects in both the conditional mean and the conditional variance. More precisely, 
the proposed approach combines a Self Exciting Threshold AutoRegressive (SE- 
TAR) model for the conditional mean with a conditional heteroskedastic model 
fitted to the residuals. Namely, we use a Constrained Changing Parameters Volatil- 
ity (CPV-C) model which allows to capture asymmetries in the conditional variance 
dynamics by means of interaction terms between past shocks and volatilities. The 
out of sample fitting performance of the model is evaluated by means of an appli- 
cation to a time series of U.S. data. 



1 Introduction 

In recent years, there has been a great interest in modelling non- 
linearity in economic time series (Potter (1995); Clements and 
Krolzig (1998)). In the Industrial Production Index, as in a wide 
range of economic time series, the main source of non-linearity 
is given by the asymmetric nature of the business cycle. Further- 
more, evidence of conditional heteroskedasticity in the Industrial 
Production Index series has been found by some authors (e.g. By- 
ers and Peel (1995)). The approach we propose in this paper allows 
to have asymmetric effects present not only in the level but also in 
the conditional variance of the series. The main intuition behind 
this is that a negative shock, corresponding to a decrease in the 
industrial production level, is expected to have an effect on the 
volatility of the series which is greater than that due to a positive 
shock of the same magnitude, corresponding to an increase in the 
industrial production level. In particular we combine a threshold 

* This paper was supported by the MURST COFIN 2000 project: “Modelli 
Stocastici e Metodi di Simulazione per l’Analisi di Dati Dipendenti” . 




148 



Amendola and Storti 



model for the conditional mean with a Constrained Changing Pa- 
rameters Volatility (CPV-C) model, recently proposed by Storti 
(1999). The out of sample fitting performance of the model is as- 
sessed by means of some well known measures of goodness of fit. 
The results are compared to those obtained by a model includ- 
ing a simple linear autoregressive component for the conditional 
mean together with a CPV-C model for the conditional variance 
of the series. The paper is organised as follows: section 2 describes 
the modelling strategy; the results of an application to the U.S. 
seasonal adjusted Industrial Production Index series are shown in 
section 3. Some concluding remarks are given in section 4. 

2 Modelling Asymmetry 

2.1 Asymmetry in the TAR model 

In many economic time series the non-linearity is typically due 
to the presence of different behaviours in negative and positive 
growth phases. Therefore, the response of the output to shocks 
occurring at different stages of the business cycle is asymmet- 
ric. This asymmetry in the level can be captured by a threshold 
model in which the different states are represented by the different 
regimes. In particular we consider the class of Threshold AutoRe- 
gressive (TAR) models first presented by Tong (1978) and further 
developed and applied in Tong and Lim (1980) (a more thorough 
discussion can be found in Tong (1990)). A TAR model can be re- 
garded as a piecewise linear autoregressive structure, which allows 
to obtain the decomposition of a complex stochastic system into 
smaller subsystems, based on the values assumed by a threshold 
variable compared with a set of predetermined threshold values. 
Let {Y t } be an observed time series, a TAR model for Y t is given 

by: P 

Yt = a ( 0 j) + a i^ Yt-i + u t ^ 

Xt-d £ Rj 

where {ut} is i.i.d. with zero mean and finite variance, the thresh- 
old values {ro, ri, r 2 , .., r s }, are such that r^ < r\ < ... < r s , 
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r 0 = — oo and r s = +oo with Rj = d is a positive inte- 

ger. If we choose as a threshold variable a past realisation of Y t , 
{y 4 _d}, the time series {Y t } follows a Self Exciting Threshold Au- 
toRegressive (SETAR) model. To investigate the appropriateness 
of a threshold non-linear model instead of a simpler linear model, 
we perform the Tsay test for detecting threshold non-linearity pre- 
sented in Tsay (1989) and successively refined and generalised by 
Tsay (1998). Under the null hypothesis of a linear model, Tsay’s 
test is asymptotically distributed as a chi-squared random vari- 
able with (p+1) degrees of freedom. 



2.2 Asymmetric effects in the CPV model 

The Constrained Changing Parameters Volatility (CPV-C) model, 
discussed in Storti (1999) and Amendola and Storti (2002), allows 
to model asymmetries in the conditional variance by considering 
interactions between past shocks and volatilities. Let {u t } be an 
observed time series such that (« t |T _1 ) ~ (0 , h \ < oo, Vt, 
with / i_1 indicating the set of information available at time (t-1). 
A CPV-C model of order (m, n), is defined as: 

m n 

'y ' Q j i,t'u , t—i T ^ ^ bjjht—j T Ct (2) 

i=i j = i 

where and b Jtt are sequences of iid random variables and, fur- 
thermore, given 6 t = [o M , ..., a m)t |6 1;i , ..., 

E(6 t ) = EidtlP- 1 ) = 0 Var{6 t ) = Var{8 t I/ 4 " 1 ) = Q 

for i = 1, , m; j = 1, , n and t = 1,,T. Also, let e t ~ iV(0,o;) be 

a Gaussian white noise error independent of 9 t . More details on 
the relationship between random coefficient autoregressive mod- 
els and conditional heteroskedastic models can be found in Tsay 
(1987). For a CPV-C model of order (m, n), the conditional vari- 
ance of u t given past information J i_1 is given by: 
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m n m n 

ht =w + ^2 a j u t-j + X/ Pfit-j + 2 ^2 ^ 2 + (3) 

j - 1 j - 1 i = 1 i=l 

-f- 2 ^ ^ 5ijU t -iU t -j -I- 2 ^ ^ T]i jhi—ihf—j 
i^j i^j 

where {u : aj, ^,^,5^,1]^}, for i = 1 ,...,m and j = 
are unknown parameters which can be estimated by maximum 
likelihood. Our experience suggests that economic and financial 
data currently encountered in practical applications do not usu- 
ally require model orders m, n > 2 and that the simple CPV-C 
(1,1) offers a flexible framework for the estimation of hf in a wide 
range of situations. In this case the equation for the conditional 
variance simplifies to: 

h% = cu + ( 4 ) 

where the asymmetry in the relation between the conditional vari- 
ance and past shocks comes from the interaction term 27i ! iiq_ih t _i 
In particular, for 711 < 0, a positive penalty term will be added 
to h\ if 1 < 0 while a positive quantity will be subtracted from 
hj if u t - 1 > 0. It is also worth noting how the magnitude of the 
penalty term added to the conditional variance does not depend 
only on the magnitude of the shock itself but this is weighted or, 
more properly, rescaled by the conditional standard deviation at 
time (t-1). So the greater will be the uncertainty associated with 
the negative shock u t ~ 1, the greater will be the impact of u t ^ x on 
hi 

A similar reasoning applies to the case in which 7^ < 0. A gener- 
alization which allows for simultaneous modelling of conditional 
mean and variance is given by the regression CPV-C model 



y t = M t P + ut (5) 

where M t is a vector (lx <7) of regressors, ft a vector (<?xl) of 
unknown parameters and u t follows a CPV-C model. The formu- 
lation of model (5) is quite general and the regression term in- 
cluded into the observation equation can incorporate an ARMA 
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type structure with exogenous explanatory variables. Also, replac- 
ing the constant parameter vector /? by a time variable vector /?*, 
we can easily accommodate for some common non-linear struc- 
tures such as Threshold Autoregressive models. The model pa- 
rameters can be estimated maximizing a Gaussian log-likelihood 
function expressed in the classical prediction error decomposition 
form. Under the above assumptions, the log-likelihood function of 
a regression CPV-C model can be written as 

T T ( \ 2 

£(y; fa, w,Q) = ~ log (2vr) - ^ log [{h t ) 2 ] - ^ 

( 6 ) 

where e t | t _ 1 = y t — The unknown parameters in {y; /3 t , w, Q} 
can be estimated by maximising the Gaussian log-likelihood (6) 
using a version of the EM algorithm tailored for state space mod- 
els by Wu et al. (1996). Alternatively scoring or quasi-Newton 
methods could also be used (see Watson and Engle (1983), for a 
discussion on the application of the method of scoring in the con- 
text of state space models). The asymptotic standard errors can 
be obtained by analytical evaluation of the Observed Information 
matrix. More details can be found in Storti and Vitale (2002). 



3 The U.S. Industrial Production Index 

The dataset we analyze is given by the series of the monthly sea- 
sonal adjusted U.S. Industrial Production Index, from January 
1940 to December 2000, for a total of 732 observations. The first 
696 observations have been used for model identification and esti- 
mation while the last 36 have been used to assess the out of sample 
fitting performance. The analysis is performed on the growth rate 
series calculated as the logarithmic first difference Y t = V(log X t ). 
The time plots of the original and transformed series have been 
reproduced in Figure 1. As a first step, on the basis of the esti- 
mated autocorrelation function, we identify and estimate a linear 
autoregressive model of order 1 with a CPV-C (1,1) structure for 
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the conditional variance 1 . This will be used as a benchmark model 
for our analysis. The estimated parameters with their respective 
standard errors are reported in Table 1. An important point is that 
the estimate obtained for the 7^1 parameter is negative and sig- 
nificantly different from 0, showing evidence of asymmetry in the 
conditional variance dynamics. In order to investigate the pres- 



Table 1. Parameter estimates and asymptotic s. e. of the AR-CPV. 



AR(1) 


ao 


ai 








0.0034 

(0.0008) 


0.3916 

(0.0348) 






CPV-C(1,1) 


UJ 


Oil 


7i,i 


Pi 




2.4189e-5 

(4.2081e-6) 


0.2892 

(0.0542) 


-0.1888 

(0.0310)| 


0.5371 

(0.0581) 



ence of threshold non-linearity we use Tsay’s linearity test (see 
Tsay (1998)). The results shown in Table 2 lead to reject the 
null hypothesis of linearity. Furthermore, the test suggests using 
a delay d = 4. 



Table 2. Results of Tsay’s threshold non-linearity test. 



d 


1 


2 


3 


4 


5 


6 


7 


Test 


2.097 


17.91 


25.13 


29.90 


6.935 


10.25 


16.93 


d.f. 


2 


2 


2 


2 


2 


2 


2 



Setting the AR order to lie in the range [1,2], given d= 4 and 
the number of regimes s G [2,3], among all the possible combina- 
tions of p, d and s we have then chosen the model characterised 
by the minimum Akaike Information Criterion (AIC) and selected 
the threshold values ri=-0.0018 and f 2 =0.0065 giving the mini- 

1 The decision to include a CPV-C conditional heteroskedastic component is based 
on the results of an ARCH-LM test performed on the residuals obtained from 
fitting an autoregressive model of order 1 to the data. The value of the test 
statistic suggests to neatly reject the null hypothesis of a simple linear model 
in favour of the alternative of conditional heteroskedasticity. These results are 
available upon request. 
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(a) 







(b) 




Fig. 1 . Original data (a) Xt and first difference of log-transformed data (b) Yt = 
V(logXt), for the U.S. seasonal adjusted (SA) IPI series from January 1940 to 
December 2000. 



mum AIC. The least squares estimates of the parameters of the 
final model, a SETAR(3;1,1,1), and the number of observations 
in each regime are given in Table 3. The results of a Lagrange 
Multiplier test for AutoRegressive Conditional Heteroskedasticity 
(ARCH-LM; Engle (1982)) performed on the residuals of the es- 
timated SETAR model (Table 4) and the analysis of the autocor- 
relation function of the squared residuals suggest the presence of 
conditional heteroskedasticity. Hence, we estimate a CPV-C( 1 , 1 ) 
model for the conditional variance. The maximum likelihood es- 
timates of the model parameters and their asymptotic standard 
errors are reported in Table 5. Again the negative sign of 7^1 
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Table 3. Least squares estimates and s. e. for the SETAR-CPV(3;1,1,1) model. 



Regime 


n.obs. 


do 


dl 


I 


186 




0.4967 

(0.0678) 


II 


270 


0.0009 

(0.0005) 


0.3678 

(0.0481) 


III 


235 


0.0064 

(0.0009) 


0.1263 

(0.0656) 



Table 4. Results of the ARCH LM test up to lag 3. 





ARCH-LM Test 


Test 


105.5662 p-value 0.0000 



shows evidence of asymmetry in the conditional variance of the 
series. This is confirmed by the news impact curve associated to 
the estimated model and reported in Figure 2. 

The out-of-sample fitting performance of the SETAR-CPV model 



Table 5. ML parameter estimates and asymptotic s. e. for the estimated CPV- 
C(l,l) model. 



OJ 


Oil 


7i,i 


Pi 


1.0158e-5 

(2.1726e-6) 


0.2354 

(0.0420) 


-0.1401 

(0.0247) 


0.7109 

(0.0368) 



is assessed by means of some widely used measures of goodness 
of fit (MSE, MAE, MSPE, RMSPE, MAPE) and the results are 
compared with those obtained by the AR-CPV model (Table 6). 
On the basis of all the indices considered, the one-step-ahead pre- 
dictions obtained by the SETAR-CPV model are superior to those 
generated by the AR-CPV model. 

Furthermore, from an economic standpoint, a switching regime 
model is more appealing since it identifies different models to 
characterize the different states of the system. Hence, the final 
model structure is consistent with the widely shared assumption 
that, during recessions, the industrial production evolves accord- 
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Fig. 2. News impact curve for the estimated CPV-C(1,1) model (a) and estimated 
conditional variance (b). 

ing to a dynamic pattern substantially different from that followed 
during growth phases. 

The one-month- ahead predictions for the period going from Jan- 



Table 6. Out of sample fitting performance of the SETAR-CPV compared to the 
AR-CPV 





MSE x 10 4 


MAE 


MSPE 


RMSPE 


MAPE 


SETAR-CPV 

AR-CPV 


0.31745 

0.32599 


0.00392 

0.00408 


7.68511 

16.56569 


2.77220 

4.07009 


1.46072 

1.84026 



uary 1998 to December 2000 have been represented in Figure 3 
together with the estimated approximate confidence bands placed 
at ±2 and ±3 conditional standard deviations. 

4 Concluding Remarks 

A novel approach to modelling asymmetry in the conditional 
mean and variance of the Industrial Production Index has been 
proposed. The overall performance of the proposed model has to 
be considered satisfactory. Our approach allows to detect and es- 
timate the impact of asymmetry on both the conditional mean 
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Fig. 3. Observed data (solid) and one-month- ahead forecasts (dashed) with ±2, 
±3 conditional s.d. approximate confidence bands for the estimated SETAR-CPV 
model. 



and variance of the series. Since asymmetry is introduced into 
the model for the conditional variance via interactions between 
past observations and volatilities, the effect of a negative shock 
on the current volatility depends not only on the magnitude of 
the shock but also on economy’s uncertainty conditions as mea- 
sured in terms of past volatilities. In order to further investigate 
the non-linear structure of the Industrial Production Index series 
we believe that in our future work it will be interesting to com- 
pare the results here presented with those obtained analysing a 
wide range of series from other countries and, at the same time, to 
consider other non-linear methods. Useful indications could also 
be gained by looking at longer forecast horizons. 
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Abstract. In literature several methods have been proposed to evaluate Customer 
Satisfaction (CS), often based on the gap concept. This paper sets a strategy of inte- 
grated analysis which in the presence of several observations for the same statistical 
unit, divides the gap between customer’s quality expectation and quality perception 
into further elements to be analyzed in order to improve the service quality. 



1 Introduction 

During the last twenty years firms’ strategy has gradually shifted 
from marketing to Total Quality Management to Customer Sat- 
isfaction. Particularly, for a company, the knowledge of the cus- 
tomer evaluation of a given service represents an important start- 
ing point for every business strategy. In literature several methods 
have been proposed for the service quality assessment. A lot of 
these models (Servqual, Normed Quality, Qualitometro etc.) are 
based on the Gap Theory of Service Quality, developed for the for- 
profit sector by Parasuraman et al. (1994). These models measure 
the gap between customer’s expectations for excellence and their 
perceptions of actual service delivered, so service providers can 
understand both customer expectations and their perceptions of 
specific services. SERVQUAL undertakes to measure service qual- 
ity across five dimensions: reliability, assurance, tangibles, empa- 
thy and responsiveness. The aggregation of these measures cre- 
ates some problems on the immaterial subjective evaluations: a) 
we have to cope with immaterial subjective evaluations and with 
not fully comparable ordinal scale of measures which are treated 
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as interval scale; b) the choice of an optimal multidimensional sta- 
tistical analysis for these data. Several techniques have been pro- 
posed in literature for the description and the exploratory study 
of these data. Aims of these proposals are to estimate the multidi- 
mensional aspects of the investigated system and the introduction 
of criteria of judgment (Lauro et al. (1997); D’Ambra et al. (1999) 
[D&al]; D’Ambra, Amenta (2000a); D’Ambra, Amenta (2000b) 
[D&A]). A deeper analysis can be considered, in the student sat- 
isfaction, if we have questionnaires taken in different K modules 
or times (for example, during a post-degree training course with 
different teachers): we have K pre-service matrices (one for each 
module) and K post-service matrices (at the end of each module). 

Aim of this paper is to propose a tensorial extension of Torre 
and Chessel’s approach in order to analyse this complex expectati- 
ons-perceptions system (Full Multi Modules). 

2 Tensorial Co-Structure Analysis 

In order to evaluate the student satisfaction, we consider a ques- 
tionnaire, taken on n students, in different K modules during a 
training course with K different teachers: it leads to have K pre- 
service and K post-service matrices at the start and at the end 
of each module. In the assessment of the service performance, the 
used scale of measure is necessarily of ordinal type. This scale 
involves a problem of comparability and, moreover, several tech- 
niques of the multidimensional data analysis deal this scale as 
a quantitative one. To overcome this problem we could make a 
data pre-coding by using one of the most popular techniques pro- 
posed in literature in order to transform an ordinal scale to an 
interval one: the psychometric approach of Thurstone. Data are 
represented in terms of matrices X k (pre-service) and Yk (post- 
service), (k = 1 of order (n x p) . Let Q be positive 

defined symmetrical matrix (metric) of dimensions (p x p) and 
D n = diag(di , . . . , d n ) , (Xl"=i d; = 1 ), weight matrix of statis- 
tical units. For each k we obtain couples of statistical triplets: 
(X*;, Q, D n ) and (Y^, Q , D n ). Following Lafosse (1989), every cou- 
ple can be defined fully matched because they describe the same 
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statistical units (n rows) by means the same variables (p columns). 
Let X and Y be three way arrays, of order (n x p x k), which 
collect these matrices X k and F&, respectively. X and Y can be 
considered, also, fully matched array. We transpose arrays X and 
Y, by stopping the first dimension, so to have X' and Y' of or- 
der (n x k x p). We vectorialise, then, the two-way matrices X, 
of X 1 and Yj of Y' , along the third dimension j (j = 1, . . . ,p): 
jX = Vec(Xj) and jy — Vec(Yj), both belonging to the vecto- 
rial space <g) we collect these vectors, by columns, in the 
matrices X = [ix, ..., p x] (see also in Dynamic Factorial Analysis, 
(Coppi, Zannella, 1979)) and Y = [iy,..., p y] of order (nK x p), 
respectively, and let, also, be M — D n <8> I with I identity ma- 
trix. We obtain fully matched statistical triplets yet: ( X,Q,M ) 
and ( Y,Q,M ). Starting from this vectorialisation and the Co- 
Structure analysis (Torre, Chessel, 1995) [T&C], which can be 
considered as a particular case of the Co-inertia analysis (Chessel, 
Mercier, 1993 [C&M]; Tucker, 1958), we search the co-structure 
axis which maximises the covariance between the coordinates of 
the row projections of the two matrices on this axis. The Tensorial 
Co-Structure Analysis maximises the following amount: 

{ “7 0 A7 \ a) or, equivalently, ( maX YkQa) 

where a belongs to iRC This methodology simultaneously carries 
out the analysis of inertia of each matrix and it considers the 
maximum of covariance between the coordinates of two clouds on 
vector a (Tab.l). The maximum is reached for the first eigenvector 
of the matrix 0.5 * Q°- 5 {X'MY + Y'MX)Q °- 5 . 



Table 1: Co-Inertia and Co-Structure of fully matched tables criteria 



[C&M,93] 


[T&C,95][D&al,99] 


[D&A, 2000b] 


Staticos 


{X\Qa, Y\Qb) Dn 


{XiQa,YiQa) Dn 
Single Module 


K 

Y,(XiQa,YkQa) Dn 

k = l 

Multi Modules 


K 

y~^(XfcQa, YkQa) Dn 

Aj = 1 

Full Multi Modules 


Co-Inertia 


Co-Structure | 


Xi and Yi are referred to the expectation and perception case single module | 



The complete matching of the two matrices X and Y leads to con- 
sider the statistical units of the two tables as elements of a single 
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space and to develop, according to the gap theory, the analy- 
sis of the triplet ( X — Y 7 Q,M). While the co-structure analysis 
represents the resemblance core between X and Y, the analy- 
sis of triplet ( X — Y,Q, M) highlights in what they are differ- 
ent. This leads to consider the eigenvectors associated to the de- 
composition of the operators {X — Y)'M(X — Y)Q linked to the 
maximum eigenvalue A. Let Inx , Iny and In £> be the inertias 
associated to the analysis of the triplets ( X,Q,M ), ( Y,Q,M ) 
and (X — Y,Q,M ), respectively. The Co-Structure and Differ- 
ence Analysis of fully matched tables are linked by the relation- 
ship Inn = Inx + Iny — 2 tr(X' MYQ)\ it decomposes the inertia 
of the difference matrix into three components and it allows: to 
analyse the customer satisfaction evaluation based on the gap, 
to consider the single pre-service and post service matrices and 
their link. The co-structure and different analysis can be per- 
formed, also, if we vectorialise the two-way X matrices (n x p) 
of X and Y*, of Y, along the third dimension k: k x = Vec(Xk) 
and kV = Vec(Y k ) 6 ® Let X = [\X , ..., kx] and Y = 

[i y,..., K y\ be matrices of order ( np x k), respectively, and let, 
also, be M = D n <g> Q. We obtain the fully matched statisti- 
cal triplets: ( X,I,M ) and (Y,I,M) as well as the relationship 
Inp = Iny + In% — 2 tr(X'MY) associated to the analysis of the 
triplets (X, /, M), (Y, I, M) and {X — Y, I, M), respectively. For 
this approach, performed on (X, Y) and/or (X,Y), we have the 
following properties or particular cases: 

• The analysis of the triplets (X, Q, M) [(X, /, M)} and (Y, Q, M ) 
[(F, /, M) ] represent the Interstructure phase of Statis (Lavit, 
1988) developed on the pre-service matrices Xj [X*,] and the 
post-service matrices Yj [I*,], respectively; 

• Similarly, the analysis of the triplet (X, Q, M) [(Y, Q, M) ] is 
the same, also, to achieve maxa'( Yhk=i X' k D n XkQ)a 
[maxa/^.j Y^D n YkQ)a\ with the constraint a'Qa = 1; 

• The analysis of the triplet (Y, /, M) could be also considered as 
a Multi-Modules version of SERVPERF (Cronin, Taylor, 1992) 
with unitary weights for the dimensions and equal number of 
items for each dimension; 
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• The analysis of X'MY Q is the Tensorial Co-Structure Analy- 
sis, which can be considered, also, as a Co-Structure approach 
of STATICO (Hanafi, 1998) that we call STATICOS (for Statis 
and Co-Structure); 

• The analysis of the triplet (X — Y, I, M) represents the Inter- 
structure phase of Statis developed on the gap-matrices X^-Y^ 
(see also Lauro et al. (1997)); 

• the Tensorial Co-Structure Analysis, performed on (AT, Y), can 
be considered a generalization of the proposals of D’Ambra et 
al. (1999) and D’Ambra, Amenta (2000b), which are particular 
cases (Tab.l). 

3 An Example of ’’Full Multi-Modules” 
Customer Satisfaction Evaluation 

Usually the services can be distinguished in four typologies: 
1) Interface service, the supply provides an interaction between 
supplier and client/user (ex. Training service), 2) Data processing 
service, the supply provides a real good (e.g. Word processor, Re- 
search program), 3) Availability service, it gives several objects at 
someone’s disposal (ex. To rent a car), 4) Process service, the good 
availability is guaranteed by a process control (ex. Electric power 
supply). In this example we consider the assessment of an inter- 
face service: the student satisfaction evaluation. Regarding to the 
method to use for the survey, we have two alternatives: the first 
one (direct method) involves the student in the process of eval- 
uation, the second one (indirect method) uses both information 
given by the personnel of contact through their relationship with 
the student and possible claims. We choose the first method and 
we use a questionnaire, to administer to the students, in order to 
have the necessary information for the assessment. The question- 
naire takes into account all the characteristics that influence the 
student satisfaction. These characteristics have been organised in: 
Teaching and General Organisation Areas. 

In order to minimise the level of intrusion in the survey, the 
expectations and the perceptions of the students are taken at the 
starting and at the end of each module. The questionnaire has 
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been given to 69 students. We have: 

- 3 expectation matrices (”Exp Modi”, ”Exp Mod2” and ”Exp 
Mod3”) 

- 3 perception matrices (”Perc Modi”, ”Perc Mod2” and ”Perc 
Mod3”). 

The 20 questions are about: Accessibility, Parking, Rooms, Teach- 
ing tools, Timetable, Length of course, Course supply capability, 
General Information, Technical Service, Nursing, (General Organ- 
isation group) ; Teaching books utility, Teaching equipment clear- 
ness, Lessons, Teacher availability, Practical work, Acquired abil- 
ity, Teacher ability, Teacher competence, Teacher involvement, 
Teacher kindness (Teaching group). Due to the nature of the 
variable (expressed on ordinal scale) the raw scores have been 
transformed by Thurstone normal model in order to substitute 
the original ordinal data with quantitative data. We develop both 
a Tensorial Co-structure analysis (STATICOS) and a Difference 
analysis of ” Multi-Modules” Fully Matched Tables regarding the 
student satisfaction evaluation of the course. 

A co-structure test (row permutations) has been performed 
on the data obtaining a significative value for the co-structure 
between pre and post-service matrices. 




Fig. 1 . The first axis of Staticos. 



In Figure 1 we have the projections of the global scores of 
the expectation and perception characteristics on the first axis of 
Staticos. It seems that, in the second area, the values of percep- 
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tions often exceed the expectations, with a neutral opposition in 
the first area: we have, in this sense, a neutral satisfaction for the 
general organisation as well as a good satisfaction in the teaching 
area. 



(a) 

Tangibles 





Fig. 2. ( a) Radar graphic of the dimensions, (b) Interstructure phase of Statis 
performed to analyse the Multi-Modules version of Servperf. 



The Figure 2(a) represents the radar graphic of the coordi- 
nates of the SERVQUAL dimensions on the first axis of STATI- 
COS performed on the matrices X and Y unfolded in raw-linked 
variables of each dimension. We highlight as the global percep- 
tions of each dimension are equal or greater than expectations. 
For the same analysis we represent the first factorial plane of the 
perceptions (Fig. 2(b)). This figure is equivalent to the represen- 
tation of the interstructure phase of Statis performed to analyse 
the Multi-Modules version of Servperf with unitary weights for 
the dimensions. 

Developing the projections of the expectations and the percep- 
tions, for each module, for statistical units on the first component 
of STATICOS, we obtain several diagrams (for example, here only 
the first 19 students), where we can read each ’’Multi-Modules” 
expectation and perception (Figure 3). We remark, in Figure 3(a) 
(upper left), the increasing evaluations, about the expectations, 
for the students 18, 5, 6 and 12 as well as the decreasing eval- 
uations for the student 1, 9, and 15. Regarding the perceptions, 
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Fig. 3. Projections of the statistical units on the first axis of STATICOS. 



Figure 3(b) (upper right), students 12 and 18 present increasing 
levels of perceptions as well as the first and the third student 
show the contrary. An overall representation of the multi-module 
expectations and perceptions can be found in figure 3(c). 

In figure 4 is represented the projections of the expectations, 
perceptions and gaps for each module on the first factorial plane, 
respectively. In this case it is evident that for the expectations and 
perceptions (Figure 4(a) and Figure 4(b), respectively), the eval- 
uations for the modules 1 and 2 are linked while the evaluations 
expressed for the module 3 is different, highlighting an overall 
different behaviour A similar representation can be performed for 
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Fig. 4. (a) Expectations: 1 factorial plane, (b) Perceptions: 1 factorial plane 




the multi-module gaps. Other information about global gaps can 
be achieved with the representation of the statistical units with 
STATICOS, here for the first 2 modules, Figure 5(a) (left) and 
5(b) (right) respectively. The starting point of the arrow repre- 
sents the position of the statistical unit before the service has 
been experienced, while the end of arrow highlights the position 
after the service has been experienced. The length of the arrow 
represents the gap. 
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Abstract. In the last years wholesale trade has changed for continous new defi- 
nition of wholesale trade function in the whole chain of the production-final distri- 
bution. Using a conjoint application of Principal Component Analysis and Cluster 
Analysis, an activity of classification of wholesale market is proposed, opposed to 
formal classification of Nace Rev.l, based both on the principal structure vari- 
ables of concern wholesales (dimensions, local unit, location, juridical form) and on 
variables relative to economic accounts (invoiced, joined o added value, marginal 
enterprices) and export activity. Some critical points arise about the use of official 
classifications and also several analytical points for future studies. 



1 Premise 

The aim of this paper is to provide a classification of wholesale 
trade enterprises based on structural and performance indicators 
in order to highlight some peculiar aspects of the wholesale mar- 
kets in Italy that the traditional classification of economic ac- 
tivities does not manage to stress. In particular, Principal Com- 
ponents Analysis (PCA) and Cluster Analysis (CA) have been 
performed and some different typologies among wholesale mar- 
kets have been emphasised. From the economic analysis point of 
view, wholesale trade sector is traditionally characterised as an 
extremely complex one (see Consorti (1995) and Lugli (1988)), 
because of its role of linkage between production and retail distri- 
bution, with which it shares both synergies and growing compe- 
tition. In recent years, in Italy, structural changes in retail trade 
and competition in manufacturing tended to vertically integrate 
the wholesale trade function to the point that volumes of goods 
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traded by traditional wholesalers in some markets have drastically 
decreased over time. Nevertheless, even in presence of marginal 
enterprises, the sector didn’t face any dramatic reduction in the 
number of operators or in the number of persons employed as it 
happened, for instance, in retailing. According to several authors 
(see Linkert (1998) and Lugli (1998)) wholesalers are transform- 
ing into services-providers and this characteristic, even if stronger 
in certain branches, has shown to be cross-sectional regardless to 
the size, localisation and specialisation of enterprises. Unfortu- 
nately, the traditional tools of analysis used also by the official 
statistics do not seem to give evidence of that and confine the 
analysis into a picture constrained by official classifications from 
which wholesale trade seems to stand alone in the economy and its 
importance is usually underestimated 1 . In this framework, for a 
long time, the most common description of the sector has been in 
terms of consistencies and comparisons with retailing with the ac- 
cent on the traditional conflict between a specialised and market 
oriented North and a backward South of the country (see ISTAT 
(1998) and Zaghi (1996)). Since it seems not fully pertinent insist- 
ing on a frame in which production-oriented wholesalers of bigger 
size and better performance belong to northern markets whilst 
consumption-oriented wholesalers of small size and modest per- 
formance characterise southern markets, the approach proposed 
by the authors aims at finding new perspectives of analysis by 
means of some methodological tools left unexplored in this field. 

2 Structural Aspects of Wholesale Trade in 
Italy 

Table 1 shows the results of 1996 ISTAT Census of Industry and 
Services and gives a first view of the sector. Wholesalers represent 
one out of ten of Industry and Services firms, one out of seven of 
Services firms and one out of three of Distributive Trade firms. 
The number of persons employed is almost one million. Firms 



1 Istat has recently produced interesting insights for a critical analysis of business 
structure and classification. See, ISTAT (1999) ch.3 and ISTAT (2000) ch.3. 
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operating on a fee or contract basis represent more than 60% of 
total wholesale 2 . One third of the remaining firms operates in 
non-food consumption goods trade, while one forth operates in 
food and beverage trading. Intermediate goods involve one fifth 
of wholesale firms, while trading in investment goods regards one 
tenth of total enterprises. Northern regions account for more than 
one half of wholesalers and more than 60% of total employment, 
mostly concentrated in intermediate and investment goods. The 
average size of Italian wholesalers is 4.7 persons employed but 
some remarkable differences can be stressed relating to geograph- 
ical and sector characteristics: central and southern wholesalers 
show smaller average sizes, intermediate and investment goods 
wholesalers are generally larger. This picture can be enriched 
with some economic information like turnover, value added etc. 
These data already show some meaningful structural characteris- 
tics of wholesale sectors. This notwithstanding, inside the three- 
digit Nace rev.l breakdown strong differences among activities 
still remain, relating both to the specific structural characteristics 
assumed by upstream industry and downstream retail trade sec- 
tors structures, and to the characteristics of products in terms of 
storage, transportation, etc.: wholesale activities concerning fresh 
fruit and vegetables, meat, fish, beverages, etc., all included in 
the 51.3 Nace rev.l group, are an example of this. More interest- 
ing results can be obtained through the sequential application of 
standard factorial methods and cluster analysis. 

3 Data Analysis Applications 

Using ISTAT data from structural business surveys and indus- 
try and services census, referred to year 1996, 16 structural indi- 
cators and 7 performance indicators were calculated for each of 
the 176 domains of study 3 . The PCA exercised on the 23 stan- 

2 The group has been excluded by the analysis proposed in this paper since the 
peculiar nature of their activity cannot be defined as wholesale trade in a proper 
sense (Zaghi (1996)). Then, from now on, the term ’’wholesale firms” will refer 
to all wholesalers but those operating in this group. 

3 The 176 domains where obtained from crossing 44 Nace Rev.l classes - cor- 
responding to wholesale trade activities - and 4 main geographical areas, i.e.: 
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Table 1. Enterprises and persons employed in wholesale trade broken down by 
3-digit Nace Rev.l. 



Nace 


Description 


North- 


North- 


Centre 


South 


Total 


Rev.l 




west 


east 














N. of enterprises 




51.1 


Trade on a fee or contract basis 


65209 


53833 


48913 


53373 


221328 


51.2 


Agricultural raw materials and 
live animals 


2589 


2223 


1571 


2510 


8893 


51.3 


Food, beverages and tobacco 


8542 


6926 


6594 


15015 


37077 


51.4 


Household goods 


14990 


9455 


9386 


12367 


46198 


51.5 


Non-agricultural intermediate 

products, waste and scrap 


10683 


6900 


5365 


6326 


29274 


51.6 


Machinery, equipment and sup- 
plies 


5785 


3831 


2487 


2789 


14892 


51.7 


Other wholesale 


3920 


1862 


1517 


1853 


9152 




TOTAL 


111718 


85030 


75833 


94233 


366814 




% on DISTRIBUTIVE TRADE 


34,9 


36,7 


30,1 


22,3 


29,9 




% on SERVICES 


15,5 


16,5 


14,3 


12,5 


14,5 




% on INDUSTRY AND SER- 
VICES 


10,8 


11,3 


10,3 


9,5 


10,4 






N. of persons employed 


51.1 


Trade on a fee or contract basis 


85814 


69407 


61865 


64271 


281357 


51.2 


Agricultural raw materials and 
live animals 


8808 


7755 


4400 


5678 


26641 


51.3 


Food, beverages and tobacco 


45801 


44791 


29242 


48616 


168450 


51.4 


Household goods 


88982 


47507 


39517 


38153 


214159 


51.5 


Non-agricultural intermediate 

products, waste and scrap 


64234 


41320 


26143 


23349 


155046 


51.6 


Machinery, equipment and sup- 
plies 


45849 


20016 


10914 


7905 


84684 


51.7 


Other wholesale 


20400 


7644 


4710 


4446 


37200 




TOTAL 


359888 


238440 


176791 


192418 


967537 




% on DISTRIBUTIVE TRADE 


37,7 


35,6 


29,9 


25,2 


32,5 




% on SERVICES 


15,5 


15,1 


9,2 


12,5 


13,2 




% on INDUSTRY AND SER- 
VICES 


7,4 


7,3 


5,6 


7,6 


7,0 



Source: 1996 ISTAT Census of Industry and Services 



Northeast, Northwest, Centre and South. The structural indicators where: size 
in terms of employed people by firm and by local unit, concentration indicators 
(general, per local unit, by size and by legal form), localisation indexes in terms 
of firms, local units and employees, average number of local units, shares of firms 
with multiple local units. Performance indicators where: turnover, value added, 
export shares on turnover, gross margin and investment per capita by two size 
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dardised variables (60% of total inertia explained by the first two 
components) pointed out the weakness of bivariate correlation 
between enterprises’ size and performance (see table 2), giving 
evidence of the fact that the presence of large enterprises in a 
domain do not imply a high performance profile in terms of profit 
margin or turnover per person employed. The first principal axis 
in fact, identified as ’’enterprise structure”, is positively corre- 
lated to structural indicators whilst the second (’’enterprise per- 
formance”) is correlated with performance indicators. PC A sug- 
gested then the likelihood of the hypothesis of coexistence of dif- 
ferent characteristics assumed by wholesale trade even within the 
same branches, each defined by the way structural and perfor- 
mance variables are combined. Figure 1, referring to Nace Rev.l 
groups, gives evidence of that since the individuals (i.e.: the do- 
mains) correspond to different combination of structure and per- 
formance are split over the all graph. 

Through the application of CA algorithms (this paper reports 
the results of the application of the Ward method performed us- 
ing all the components extracted by PCA 4 ), two partitions have 
been examined, one (explaining 50,2% of total inertia) consisting 
of four clusters of domains and the other - derived from the former 
and explaining 64% of total inertia - consisting in eight clusters 
mainly subsets of the main groups identified by the first partition. 
The four clusters partition gave us the possibility to perform a 
preliminary map of wholesale activities, highlighting the relations 
between structural characteristics, commodity specialisation and 
localisation. The eight clusters partition made it possible a better 
focusing on the previous analysis: in particular, it has been pos- 
sible to establish the presence of a multitude of wholesale trade 
typologies even within similar markets. The first cluster gathers 
the domains of traditional wholesale trade, strongly characterised 
by small-medium sized mainly unincorporated firms, highly lo- 

classes and profit ratio. For brevity, only factors with eigenvalues higher than 0.6 
and variables with the absolute value of the correlation coefficients higher than 
0.7 are shown in the table. 

4 The use of a limited number of the factors extracted (with eigenvalues higher 
than 1) has also been experimented. K-means approaches were also used. In both 
cases, substantially similar results were obtained. 
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Table 2. Description of the first two factors of PCA(l) 



Fac- % In- 
tor ertia 



Factor 1 
Variables 



Factor 2 
Variables 



1 45.8 

2 13.2 

3 9.9 

4 6.5 

5 5.3 



Enterprises, with > 20 per- 0.93 
sons employed (share) 

N. of persons employed 0.91 

N. of persons employed in lo- 0.89 
cal unit 

Persons employed in firms 0.88 
with > 20 persons employed 
(share) 

Inc. enterprises, with > 20 0.85 
persons employed 
N. of local unit per enterprise 0.76 
Persons employed in multi- 0.72 
localized enterprises (share) 

Inc. firms with i 20 pers. em- 0.69 
ployed (share) 

Pers. employed in inc. firms 0.67 
with > 20 pers. employed 
(share on tot inc. firms) 



Investment per person em- -0.70 
ployed 

Turnover per person em- -0.71 
ployed 

Gross operative mar- -0.71 
gin/ value added 

Value added per pers. em- -0.73 
ployed 



(1) For brevity, only factors with eigenvalues higher than 0.6 and variables with the 
absolute value of the correlation coefficients higher than 0.7 are shown in the table. 



calised and with modest profit margins, trading with food and 
beverages (mainly fruit and vegetables), tobaccos and agriculture 
raw materials; a large part of southern wholesale trade employ- 
ees is also represented here. Within this group, however, in the 
frame of the eight cluster partition, it is possible to isolate as an 
homogeneous subset the activities of southern wholesale trade of 
fresh fruit and vegetables, characterised by a strong presence of 
small firms, with a higher propensity to export but small profit 
margins. 

The second cluster aggregates more organised traditional who- 
lesale activities. 

Within this group meaningful differences among domains still 
remain, but it is possible to split it in three subgroups. One is 
composed by small-medium size food wholesale activities, with 
modest profit margins and moderate demographic trend. 
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Table 3. Synoptic table of wholesale trade 



3- digit Nace Rev.l 4- digit Nace Rev.l 


N-W 


N-E 


C. 


S. 


Agricultural raw materi- Grain, seeds and animal feeds 
als and live animals 




2C 






Flowers and plants 


1A 








Live animals 










Hides, skins and leather 




2B 






Raw tobacco 






2B 




Food, beverages and to- Fruit and vegetables 
bacco 


2A 




2A 


1A 


Meat and meat products 


2A 


2A 


2A 


2A 


Dairy produce, eggs and edible oils and 
fats 


2A 


2A 






Alcoholic and other beverages 




2A 






Tobacco products 










Sugar and chocolate and sugar confec- 
tionery 










Coffee, tea, cocoa and spices 




2C 






Non-specialised food, beverages and to- 
bacco 


4B 


4A 


4B 


2A 


Other food, including fish, crustaceans 
and molluscs 


2C 


4B 


2B 


2A 


Household goods Textiles 


2B 


2B 


2B 




Clothing and footwear 


2B 


2B 


2B 




Electrical household appliances and radio 
and television 


4A 


4A 


2B 


2B 


China and glassware, wallpaper and 
cleaning materials 


2C 


2B 


2B 




Perfume and cosmetics 


2C 


2B 


2B 




Pharmaceutical goods 


4A 


4A 


4A 




Other household goods 


2B 


2B 


2B 




Non-agricultural inter- Solid, liquid and gaseous fuels and related 
mediate products, waste products 
and scrap 




2C 




2B 


Metals and metal ores 


4A 


4A 


4A 


2B 


Wood, construction materials and sani- 
tary equipment 


2B 


2B 


2B 




Hardware, plumbing and heating equip- 
ment and supplies 


2C 


4A 


2C 


2B 


Chemical products 


2C 


2C 


2C 




Other intermediate products 


2B 


2B 


2B 




Waste and scrap 










Machinery, equipment Machine-tools 
and supplies 


2C 


2B 






Construction machinery 


2C 


2B 


2B 




Machinery for the textile industry 


2C 


2B 


2B 




Office machinery and equipment 


4B 


2B 


2B 




Other machinery for use in industry, trade 
and navigation 


2C 


2B 


2B 




Agricultural machinery and accessories 
and implements 


2B 


2B 


2B 




Other wholesale 


2B 


2B 
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Another subgroup is composed by the domains of the non-food 
wholesaler of Central and Northern regions with a medium aver- 
age size, a higher propensity to export and with a more capital 
intensive structure. In the last subgroup firm size is larger, incor- 
porated firms more numerous, profits margins and the other eco- 
nomic indicators showing better values: these are mainly northern 
domains, dealing with durable and intermediate goods wholesale. 

The third cluster is composed by fuel wholesalers, highly spe- 
cialised, with considerably high profit margins and with a stronger 
presence of large firms. This is a very peculiar cluster which is not 
possible to split further due to the extreme homogeneity of the 
firms both in terms of structure and performance indicators. Fi- 
nally, the fourth cluster can be referred to as being composed by 
the most advanced wholesale activities. It can be divided in two 
subgroups. The first is composed by the best performing wholesale 
branches of the Northern regions trading with pharmaceuticals, 
durable goods and metals. 

The second is composed by the non-specialised food whole- 
salers that are mainly large capital intensive firms, with a high 
propensity to export and lower profit margins. Table 3 summarises 
the results of the two partitions giving evidence of the coexistence 
of different typologies in each class of the Nace Rev.l. Wholesale 
of food, beverages and tobacco shows interesting specific char- 
acteristics explainable both on the basis of regional and sector 
differences. In the partitions these activities are split in the first 
two groups. Some domains, belonging mainly to the northern re- 
gions, are characterised by more modern business structures, and 
it is likely in the authors’ view that this could reflect the influ- 
ence of a stronger competition due to the presence of retail trade 
larger firms more oriented toward some sort of ’’self-wholesaling” 
activity (see Confcommercio, 1999) . In southern regions, where 
firm size is generally smaller, the wholesale trade of fruit and veg- 
etables represents an extremely vital sector with a strong export 
activity: in northern regions, these same activities is grouped to- 
gether with durable and intermediate goods wholesales activities. 
Meat wholesale is instead represented by very homogeneous do- 
mains in group two. The wholesale of beverages, tobacco, sugar, 
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coffee, etc., is characterised by a small average firm size, with the 
only exception of a few domains in north eastern regions. 
Wholesale of household goods shows a sharp divide between south- 
ern regions and the rest of Italy. While in the south firms are 
generally small and poorly structured, the rest of the domains 
are characterised by two sets of structures: on one side, the more 
advanced medium firm domains, with higher profit margins and 
better values of the other economic indicators, on the other the 
small firms domains with a more traditional organisation but any- 
way characterised by a strong presence of incorporated firms and 
by good profit margins and value added per capita. 

Wholesale of non-agricultural intermediate products, waste and 
scrap, presents strong differences with regard to business structure 
and performance. Both territorial and product effects influence 
these differences. So part of the domains (especially southern) 
are grouped among the more traditional and small size whole- 
sale activities, other are grouped among the small-medium sized 
domains with good profit margins, and the rest present the struc- 
tural characteristics of the more evolved wholesale activities. 

Geographical factors explain most of the difference among 
wholesale of machinery, equipment and supplies. In the south 
small firm domains prevail. In the other regions, wholesale ac- 
tivities are centred on small and medium firms, mostly charac- 
terised by good profit margins. Larger firms are present in north- 
ern regions’ domains. The remaining wholesale activities belong 
to substantially homogeneous Nace rev.l groups. 

4 Conclusions 

PCA and CA revealed themselves as very useful methodological 
tools in analysing large economic data sets, although they are not 
systematically present in the economist’s tool-box. In our case, 
they helped us to show how difficult is to base an up-to-date 
and coherent analysis of the sector only on the traditional ap- 
proaches, mainly based on Nace rev.l classification of economic 
activity. The results highlight the dramatic heterogeneity of the 
wholesale trade sectors, and the coexistence of different trading 
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Table 4. Synoptic table of wholesale trade - Legend 



Clusters Description CODE Characteristics 



First cluster 


Small size traditional whole- White 
salers 


Micro enterprises, low perfor- 
mance 


Subgroup 1A 


fruit and vegetables localised 1A 
mainly in the South 


Very localised, good perfor- 
mance, high share of exports 
on turnover 


Second cluster Traditional wholesalers in 
consolidation 


Small-medium size enter- 
prises 


Subgroup 2A 


food wholesalers mainly lo- 2A 
calised in the North and Cen- 
tre 


No local units, modest prof- 
its, low export 


Subgroup 2B 


medium size non-food whole- 2B 
salers 


Good profitability, low 

turnover per pers. employed 


Subgroup 2C 


Wholesalers of durable and 2C 
intermediate goods mainly 
localised in the North 


Medium-big sized, localised, 
medium-high profits, high 
turnover per pers. employed 


Third cluster 


Wholesalers of fuel and re- 
lated products 


Big sized, very high profit 
margins, very good perfor- 
mance 


Fourth cluster Advanced wholesalers 


Big size, several local units 


Subgroup 4A 


highly advanced wholesalers 4A 
in the North 


very high profit margins, 
very localised, incorporates 
high value added 


Subgroup 4B 


non-specialised food whole- 4B 
salers 


low profit margins, low value 
added per pers. employed 



patterns even within similar markets and products. In particular, 
they help to make clear the importance of the economic relations 
between the wholesale activities and the corresponding upstream 
(mainly industrial) and downstream (mainly distributive trade) 
sectors. Important insight have been also offered for stratification 
and sampling criteria to be used in business surveys dedicated to 
these sectors. This work represents anyway only a first step to- 
ward a deeper comprehension of the business structure of whole- 
sale trade activities. Three further deepening seem to us extremely 
fruitful, one of methodological and one economic nature. On one 
side, in fact, significant improvements can be obtained through- 
out a refinement of the PCA and CA methods we applied, espe- 
cially towards the recent improvements in the joint applications 
of these methods, secondly, it is necessary to explicitly consider 
the upstream and downstream business structures, linking them 
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with the wholesale sectors’ structure and building in such a way 
more comprehensive and general explanations. Thirdly, it seems 
quite promising the analysis of data for subsequent years, in order 
to highlight the relationship between structural changes and the 
results of cluster analysis. 
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Abstract. The commonly used criterion to sharply separate the poor from the non 
poor on the basis of a poverty threshold appears to be too severe in comparison with 
the nature of poverty. The latter is multidimensional in its components (domain) 
and continue in its states (codomain). Moreover an income-based poverty line allows 
for a remarkable number of spurious transitions below and over that line, which do 
not correspond to true variations in household’s standard of living. This study starts 
from the analysis of common (with crisp states) transition matrices; then a fuzzy 
multidimensional poverty indicator is built. In conclusion, fuzzy states transition 
matrices synthesize interpretative content of previously proposed instruments for 
the analysis of poverty. 



1 Introduction 

Poverty could seem a clear and sharp concept if we intend it 
as the opposite of economical richness. Nevertheless in modern 
approach to the measurement of poverty, this concept compre- 
hends several aspects which involve also social dimension, among 
which: social marginality, poor participation in political life, dis- 
satisfaction about one’s own role in society, inadequate housing, 
low education, difficulties in transforming resources in “function- 
ings” (according to Sen (1992) and (1985)). Do two persons with 
the same level of expenditure have the same well-being? Proba- 
bly it could be not if they have different pattern of consumption: 
for example, we can not evaluate in the same way expenditure for 
medical cares and for holidays. Indeed, we could ask if two persons 
(having the same level of income and social opportunities) could 
have different capabilities in transforming economic resources in 

* Although this paper is due to common work of Authors, sections 1 and 3 are 
attributable to S. De Cantis; sections 2, 4 and 5 to D. Mendola 
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well-being because of different level of education or personal abil- 
ity (see Sen (1992)), or if they could have different perception of 
their own status. Moreover, we could ask if equidistribution of 
economic resources means absence of poverty or if poverty has to 
be contrasted because it produces social injustice or because it is a 
damage for economic development. Only rarely do researchers ex- 
plicit theoretical assumptions underlying hypotheses and choices 
about relevant dimensions of concept of “poverty” . We maintain, 
in fact, that, in analyzing this phenomenon, one would assume, 
implicitly or explicitly, an economical paradigm (to express the 
economical behavior of the agents and to explicit if the well-being 
is intended as utility, social welfare or else), a political paradigm 
(in which it must be explicated what is “equity” : equidistribution, 
protection of weak categories, democracy, political participation, 
. . . , and what of these finalities should be persecuted, see Ha- 
genaars (1986)), a sociological theory to interpret the effect of 
poverty in social stratification (Abel Smith (1984)) and a psycho- 
logical theory to detect subjective perception of one’s own indi- 
vidual condition (role) respect of society (Hagenaars (1986) and 
Van Praag (1991)). Each of these paradigms has in fact a relevant 
impact on the choice of the components that are to be included in 
the concept of poverty, and on the weights we must give them. Al- 
though arbitrariness is intrinsic in applied sciences, we think that 
it is not a limit but a stimulus towards new specifications and de- 
velopments in the research object. For more reflections upon the 
implications of different approaches to the study of poverty and 
their consequences on logical, methodological and interpretative 
aspects, see Mendola (2002). 

2 The Approaches and the Definitions of 
“Poverty” 

Classifying a unit (individual or household) as poor is a complex 
procedure, even if we have already chosen variables representing 
poverty status. In literature there are three general approaches to 
define poverty, known as absolute, relative and subjective. In the 
absolute approach poverty is intended as the condition in which 
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households are not able to get a minimum level of objective welfare 
(in any way we intend it) that is judged as socially acceptable. 
The basic needs approach is a way to make operative this con- 
cept. The first step consists in the selection of a set (“basic set”) 
of goods and services that every household should posses to get 
an acceptable standard of living; this set is different according to 
the characteristics of the household members (such as age, kind of 
job, . . . ). In a second moment, we express this set by its monetary 
value and finally we compare this monetary threshold with house- 
hold income or expenditure. If the income perceived by household 
is lower or equal to threshold, household is classified as “poor” . 
We refer to a relative approach when poverty is intended as an 
objective difference from some average standard of living in the 
society. The official approach of ISPL ( International Standard of 
Poverty Line), adopted in most of industrial countries, is the most 
common application of the relative approach. It defines a two- 
members household as “poor” if its income (or expenditure) is 
equal or lower than the mean national pro capite income (expen- 
diture) . The mean national pro capite income (that is the “poverty 
threshold” ) is modified in order to account for household size dif- 
ferent from two by using coefficients of an equivalence scale, which 
take account of economical scale of cohabitation. Therefore, ISPL 
is a unidimensional criterion based on income (or expenditure) 
as the unique determinant of household’s welfare. This definition 
does not involve social deprivation or it considers the latter as 
included in (and well represented by) the insufficiency of income. 
Recently, we noted the definition of subjective poverty: the per- 
ception that people have of their own situation, as an individual 
feeling. For lack of information about household perception on its 
own poverty, it is necessary to prepare suitable measurement in- 
struments (e.g. specific questionnaires). Complexity and variety 
of necessary information for a subjective measure of poverty get 
this approach very limitedly applicable. Interesting attempts were 
made in Leiden Income Evaluation Project (Hagenaars and Van 
Praag (1985)). 

The choice of a more coherent approach to measure poverty is 
strongly related with aim of the analysis; according to Abel Smith 
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(1984) we can say: “ How should poverty be defined? This question 
cannot be answered until we have decided for what purpose we 
want a definition” . In fact if poverty is cause of discontent and 
a damage for social cohesion, or it is perceived as a difficulty in 
daily managing of life, or it is only the cause of a social stratifi- 
cation according to poverty levels, we have to admit that only a 
subjective measure of poverty can take account of these aspects. 
Anyway, accordingly to recent literature and as far as we concern, 
poverty is a multidimensional phenomenon, and we find income 
(or expenditure) a biased indicator of poverty’s status, even if we 
understand that in an official contest, it could be hard to interpret 
more complex indicators (i.e. a multidimensional ones or a sub- 
jective ones) . The operationalization of the concept of poverty has 
hardly been agreed by researchers. Sen himself pointed out that 
hardly a measurement can be sharper than concept it expresses. 
Starting from these considerations, our study origins from the 
need to: 

a) overcome the unidimensionality of the domain (determi- 
nants) and the dichotomy in the codomain (spaces-states) of the 
poverty function: the proposed indicator is in fact a synthesis of 
more simple indicators and it arranges deprivation state on a con- 
tinuum according to a fuzzy approach; 

b) analyze poverty in its dynamic components: this study has 
been carried out on a panel sample of households of which we have 
estimated the joint membership to fuzzy state of poverty along 
different periods. 

Moreover, as far as concerned with our purposes, the longitudinal 
approach takes advantages of temporal invariance of part of the 
household characteristics that could act as confounding variables. 

3 Poverty Analysis by Crisp States Transition 
Matrices 

There are no surveys in Italy specifically built up to measure 
poverty, but only some surveys, focussed on households, that col- 
lect information about general socio-economic characteristics. The 
older is the Istat Annual Survey on Consumption (Istat (2000)): 
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it collects a detailed set of variables related to level and qual- 
ity of household consumption and about main socio-demographic 
characteristics. The second one is the biannual Bank of Italy’s 
Survey of Household Income and Wealth (Banca d’ltalia (1997)), 
which put attention on the nature and the entity of household’s 
income and patrimony. These two surveys, that are the only ones 
with a national relevance and a sufficient continuity during years, 
point out on economic characteristics of households and let out 
the social dimensions, necessary to complete the description of 
the poverty status. The latter constitutes a remarkable handi- 
cap for a correct and complete approach to the measurement of 
poverty. Although, since we are interested in studying the dy- 
namic of poverty between households in a national perspective, 
we decided to refer our analyses to the panel component of Bank of 
Italy’s Survey (the Istat survey doesn’t have a long enough panel 
component). According to this, we considered only the households 
that participated to each survey from 1989 to 1995. It is obviously 
a self-selected sample but it seems to respect some representative 
criteria (Mendola (1999)). According to the traditional analyses 
of poverty, based on ISPL, we computed main classic indicators 
of poverty using income data and the Italian official equivalence 
scale (Carbonaro (1985)). Diffusion and intensity of poverty be- 
tween panel households are almost stable during the first two years 
we considered: we found a headcount ratio of 6.4% and a poverty 
gap ratio (the per cent gap from poverty line of the poor) of about 
20%. In the second period we note a worsening: poverty incidence 
rises to about 11% in 1993, and in 1995 it decreases only slightly 
(9.7%). Furthermore poverty is increasingly severe: the poverty 
gap jumps to 28.7% in 1993 and reaches 27.3% in 1995 1 . 

The analysis of crisp states transition matrices (that allows us 
to overcome, at least partially, arbitrariness just induced by the 
definition of the poverty line) shows that, comparing two adjacent 
periods, households which are stable are about 90% of the sam- 



1 The “jump” between 1991 and 1993 is perfectly consistent with the economic 
recession period through whom Italy went in 1992-1993. These are the years of 
depreciation of national money and of the exit from EMS. Also in our analyses 
1993 is the worst year (about economic hardship) for Italian households. 
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Table 1. Permanencies and changes: transitions into and out of poverty 







Permanence in 


Change of 


Odd Ratios 


Odds Ratios 


Transition periods 


the same state 


state 


(permanence 


Confidence Interval 






sample % 


sample % 


vs. change) 


(a 


= 0.05) 


I lag 


1989-91 


93.0 


7.0 


21.260 


11.034 


40.964 




1991-93 


89.6 


10.4 


13.876 


7.616 


25.282 




1993-95 


90.7 


9.3 


22.755 


13.269 


39.024 


II lags 


1989-93 


89.4 


10.6 


12.640 


6.953 


22.978 




1991-95 


90.2 


9.8 


12.839 


7.009 


23.522 


III lags 


1989-95 


88.7 


11.3 


7.212 


3.900 


13.336 



pie. Consequently transition flows (see Table 1) are about 10% 
which is to be considered quite large: most likely some of them 
are spurious. But, even if the permanence and change percent- 
ages seem quite stable when the observed lag increases, the odds 
ratios (OR) clearly decrease. In fact we can observe that between 
’89 and ’91 the probability to stay permanently poor or not-poor is 
about 21 times that to pass through a transitory hardship; when 
lag increases, e.g. ’89-’95, the risk to change state is, as expected, 
substantially smaller (OR equals to 7.2). Moreover the longer the 
permanence in a state, the smaller the probability of change. 

4 A Fuzzy Multidimensional Poverty 
Indicator 

Building up a multidimensional indicator of poverty is a strongly 
arbitrary process, in which the researcher has to do sharp as- 
sumptions, principally about what are the dimensions of the phe- 
nomenon, what are their measures and also about methodologies 
to handle and to aggregate data information. Choice of relevant 
dimensions of poverty can not be argued on formal bases, this is, 
in fact, a researcher’s option that is not confutable or falsifiable. 
The aggregating process is instead, without doubt, one of the 
statistically most relevant step in arranging a synthetic measure 
of poverty, and it should respect some basic statistical coherence 
properties. Main methodological choices, after we selected the el- 
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ementary components of our indicator, are well synthesized in 
these steps: 

1) the transformations to be operated on data to obtain an unit 
of measure which is common to heterogeneous components; 

2) the choice of an aggregating function able to arrange elemen- 
tary indicators (i.e. by sum or product) which is coherent with 
nature of relationship between the whole concept of poverty 
and its ’’components”; 

3) the selection of a sustainable weighting system according to 
some definite criterion. This process is not free from a high 
arbitrariness. 

Moreover, separation between the poor and the non poor by a 
household-size adjusted poverty line (like ISPL), which is based 
on a unidimensional economic criterion, leaves us unsatisfied be- 
cause it induces a remarkable number of spurious oscillations on 
the poverty line, due to minimum income variations which do 
not correspond to real changes in household welfare, especially 
if we refer to changes in current expenditure level. According to 
the multidimensional nature of poverty, we decided to decompose 
it into four main dimensions, constrained by available informa- 
tion in Bank of Italy’s survey. Remarkable symptoms of the pres- 
ence of poverty, that we chose as representative of the dimensions 
of poverty concept, are reported in Table 2 and refer to: finan- 
cial hardship, housing inadequacy and subjective perception of 
poverty. The subjective dimension of poverty was measured ac- 
cording to the yes-no question: “/n 1995 did you or another mem- 
ber of your household consider the possibility of applying to a bank 
or a financial company for a loan or a mortgage but then change 
his/her mind thinking that the application would be rejected?'. 
The latter was, in our opinion, the best indicator of self-perceived 
poverty, available from the Bank of Italy survey. 

Finally, food-ratio (the ratio between the expenditure for food- 
stuffs and the total expenditure of a household), which strictly 
expresses food needs but whose large interpretative potential in 
poverty analysis is well known, was added as a fourth dimension. 
Ten selected simple indicators could be arranged in many ways, in 
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particular using multivariate statistical methods such as Principal 
Component Analysis or Multidimensional Scaling, but we chose to 
built up a simple (linear) synthetic indicator while experimenting 
the interpretative potentialities of fuzzy set theory (Zadeh (1965), 
Dubois and Prade (1980)) to express simple indicators. In fact 
we consider fuzzy theory particularly consistent with intrinsically 
continue nature of poverty. 

Let P be the population we are referring to, and let h be a house- 



Table 2. The dimensions of the concept of poverty 



Dimensions of 
poverty 


Indicators 


Measurement 

levels 




- debts for the purchase of furni- 




financial 


ture and house’s goods 
- debts for the purchase of non- 


yes/no 


hardship 


durable goods 


yes/no 




- debts with friend and relatives 


yes/no 




- surface of house in m 2 


7 ord. categ. 


housing 


- property status 


3 ord. categ. 




- kind of house 


6 ord. categ. 


inadequacy 


- bathroom 


yes/no 




- house heating 


yes/no 


subjective 

perception 


- self-perceived poverty 


yes/no 


food needs 


- food-ratio 


(percentage) 



hold in P; let A be the fuzzy subset of the poor belonging to P. A 
is defined by the couple: A={h, ti(h)} where f^(-) is called mem- 
bership function of households in fuzzy set of the poor A. The 
Ci(-) is a function that maps the set P in the closed-limited real 
interval [0,1]. In particular: 

f^(h) =1 means that h £ A completely 
fyi(h) = 0 means that h ^ A surely. 

The values of the function so that 0 < /^(/z) < 1 mean the grades 
of membership of h in fuzzy set A; the higher the value of f^ 
for household h, the sharper its membership in fuzzy poor set. 
Between the membership functions available in literature, we se- 
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lected the one proposed in Cheli and Lemmi (1995). So we aggre- 
gated the indicators in a composite index known as TFR ( Totally 
Fuzzy and Relative ) which, for household i is: 

k k 

fM = '^2g(x ij )w j J^wj i=l, 2, h, n (1) 

i= i j - i 

where g(x^) is the grade of membership ( gom ) in poverty fuzzy 
set according to the j-th indicator Xj (symptom of poverty) for 
household i and Wj are weights which decrease as symptom diffu- 
sion increases in the population. So, f(xi.) is a weighted mean of 
k goms in fuzzy poor set. 

In particular, according to TFR method: 



a) g(x^) are based on the empiric distribution functions H(-) 
of every simple ex-post poverty indicator X 5 -: 



if Xij = x 0 



(i) 



g( x ij) 



g{xf l) ) + 









-H(xf) 



if Xij=xf\(k> 1) 



(2) 

x^\ representing the m categories of the variable Xj ar- 

ranged in increasing order with respect to the risk of poverty, so 
that x'p denotes the minimum risk and x ^ the maximum risk; 



b) weights are given by Wj = ln[l/g*(xj )] where g* is a mean, 
over all households, of goms based on j-th indicator and repre- 
sents the fuzzy proportion of the poor with respect to Xj (further 
methodological details in Cheli (1995)). 



For the sake of brevity, we expose here only the results re- 
ferring to the last two years considered, i.e. 1993 and 1995 (for 
more details, see Mendola (1999)). Simple indicators which had 
larger weights in equation (1) are about housing inadequacy: lack 
of house heating ( Wj = 1.79) and the property state of the house 
(owned, rented, usufruct,) ( Wj = 1.33); about economic hardship: 
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need of a loan to purchase non durable goods (likely for current 
expenditures) (wj = 4.52) and to buy furniture and objects for 
the home (wj = 3.50). Finally the food-ratio and the subjective 
dimension contributions are appreciable (wj = 3.54). Synthetic 
indicators for the whole sample, that were obtained as a simple 
mean of (1), in the two years are equal to P93 = 0.0885 and P95 = 
0.0832. Note that, to allow comparisons, we use the same weight 
vector for the two years, i.e. the 1993 one; anyway the indicator 
P95 made with cross-sectional weights is very similar and equals 
0.0897. If we consider the TFR as a generalization of the head 
count ratio, we can maintain that the fuzzy proportion of poor 
households in 1993 equals 8.85% (instead of 11% obtained with 
traditional method), while in 1995 we have a fuzzy proportion of 
8.2% (instead of 9.7%). The reason of this pattern is that likely 
the use of many indicators (of a multidimensional criterion) makes 
the poverty indicator more robust to income fluctuations and pre- 
vents from part of the spurious transitions mentioned above. 

5 Transitions into and out of the Fuzzy Set of 
the Poor 

For household i, let gjj^ be the grade of membership in state k at 
time t, and let g be its grade of joint membership in state k 
at time t\ and simultaneously in state v at time £2, which is here 
defined, according to the minimum entropy criterion (Manton et 
al. (1992)), as: 

f S = 9 

9 { ik,f 2) = min[g^ k l) , p^ 2) ] constrained to l 

(3) 

Table 3(a) shows the matrix of mean joint membership of house- 
holds in the two fuzzy sets of the poor (fc=l) and the non-poor 
(u = 0) at time t\ = 1993 and £2 = 1995. 

Analysis of mean joint membership matrix allows us to distin- 
guish permanent and transitory poverty. In fact E^* 1 ’* 2 ^] is the 
fuzzy proportion of households that are poor in ti and in t 2 , so 
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Table 3. Mean joint membership and fuzzy transition matrices 



1993 


1995 




poor (1) 


non poor (0) 




poor (1) 


0.704 


0.296 


1 


non poor (0) 


0.023 


0.977 


1 











1993 


1995 




poor (1) 


non poor (0) 




poor(l) 


0.06 


0.03 


0.09 


non poor(0) 


0.02 


0.89 


0.91 




0.08 


0.92 





(a): Mean joint membership matrix [E(g^^)] (b): Fuzzy transition matrix [t^u] 



we can assume this quantity as a permanent poverty index ; while 
is the fuzzy proportion of households that are 
poor during one in two years hence they go through only a tran- 
sitory poverty. So for the sample households, we have an index 
of permanent poverty equal to 0.06 (the fuzzy proportion of the 
poor) while the fuzzy proportion of households with only one year 
of poverty is 0.05 (transitory poverty index); households experi- 
menting at least one year of deprivation represent a fuzzy propor- 
tion of sample equal to 0.11. 

From the mean joint membership matrix, Tab. 3(a), we can ob- 
tain the fuzzy transition matrix , Tab. 3(b), whose generic ele- 
ment, tkv=E[gf k 1 J t ' 2 ' > }/E[gll 1 ' > ], expresses households’ propensity to 
change state. As we can see, the “probability” (or more correctly 
the propensity ) to stay poor (t n =0.704) is rather high but consid- 
erably lower than that to stay non-poor (too=0.977). Moreover, 
Table 3(b) points out a consistent upward mobility as shown by 
the value tio=0.296 (propensity to move from poverty to non- 
poverty) noticeably higher than the probability to move along 
opposite directions (toi=0.023). 
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Abstract. When an investigator performs a sequential consulting of ‘experts’ to 
reduce his/her own uncertainty about an unknown quantity, efficiency reasons sug- 
gest he/she should use some criteria for selecting the ‘best’ expert to be consulted at 
each stage and for deciding at which stage the information acquiring can be stopped. 
In this paper some selecting and stopping rules are proposed, which are founded on 
some synthetic measures of the (expected) additional informative value of a not-yet- 
consulted expert and of the informative value of the already-acquired knowledge. A 
case-study, by exemplifying how the algorithms work, shows the soundness of the 
proposed informativeness criteria. 



1 Introduction 

In condition of uncertainty, an investigator can acquire informa- 
tion regarding an unknown quantity — the probability of an 
event, a risk, a future observation, a physical quantity — from 
heterogeneous sources of knowledge (in the following, named ‘ex- 
perts’): for example, research centers, information systems, theo- 
retical or empirical models, professional opinions. 

In such a context, the investigator has to face two fundamental 
questions: 

• how to synthesize knowledge. Numerous algorithms have been 
proposed, which can be distinguished into two main classes 
(for a critical review, see Genest and Zidek (1986); Cooke 
(1991)): axiomatic procedures, which resolve the problem in 
terms of weighted averages (Laplace (1812)), and Bayesian 
combinations, where — by treating each expert’s judgment as 
experimental data to be used in updating the investigator’s 
prior distribution — the ‘synthesis’ is represented by a pos- 
terior distribution on the unknown quantity (Morris (1977); 
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Jouini and Clemen (1996)). The most appropriate procedure 
is to be chosen each time by considering various factors as, for 
example, the context of the problem and the sort, the qual- 
ity and the amount of data available for estimating relevant 
parameters; 

• how to choose among options. Often the investigator prefers to 
consult the experts in successive stages rather than simulta- 
neously. So, he/she avoids wasting time (and money) by con- 
sulting a too large sample of experts: at each stage, depend- 
ing on the amount of information reached, he/she can choose 
whether to stop or to continue the process and, depending 
on the answers obtained from the experts already contacted, 
he/she can select the ‘best’ expert to be consulted on the sub- 
sequent stage. 

The aim of this work is to propose some selecting and stopping 
rules which can be suitable to be used in a sequential consulting 
process. The substance of such rules is almost independent of the 
procedure chosen for combining information from the experts; not 
so their mathematical form. The reference, in the present work, 
is the Bayesian aggregation model suggested by Morris (1977), 
reviewed in a recursive form: although it cannot be regarded as 
‘the best’ combining algorithm, it’s undoubted that the Bayesian 
paradigm offers a logically coherent answer to the expert use prob- 
lem. 

The paper is organized as follows. Section 2, in writing Morris’ 
aggregation algorithm in a recursive form, gives the notation for 
the successive sections. In Section 3, some stopping and selecting 
criteria are suggested. The results from an application performed 
on real data, together with some concluding remarks, are pre- 
sented in Section 4. 

2 A Recursive Algorithm for a Sequential 
Belief Revision 

An investigator A, who is uncertain about the value of a random 
quantity 6 € 0 C 5R, expresses his initial personal state of infor- 
mation in a prior probability distribution ho (0 ) . For reducing his 
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subjective uncertainty, A performs a sequential consulting of (at 
most n) experts Qf at each stage k (k — 1,2, K; K < n), 
the selected expert Q*j. k (or, more briefly, Q k) answers by giving a 
personal density g {9). Treating each expert’s density as result of 
an experiment, A can revise his own beliefs via Bayes’ theorem. 
Assuming that (Morris (1977)): 

a) each (•) is parameterized with a location parameter m*, and 
a shape parameter Vk', 

b) for each k, the probability which A assigns to the event v ^ = 
n^ t — that is, the event “the shape parameter values the 
experts will give are [vi, ..., Vk}' = v” — does not depend 
on 9: in symbols, £ (u^l#) = 

= £ (v^y, 

Morris shows that the posterior density can be written as 1 , 



h ( 9\m ^ 




£ {m^\v^ k \ 9 ) • h 0 (9) 
f e £ (mk k ^\v^ k \ 9) ■ h 0 (9) d9 



(1) 



where: 

— £ (m,^ \v( k \ 9 ), denoted in the following by £\~ (0) for notational 

convenience, indicates the subjective conditioned likelihood 
function of 9 for the data m ^ = f)t=i given v^: it rep- 
resents — for 9 varying — A’s probabilities that the location 
parameter values the experts will provide are m = [m^ i=l k ; 

— the posterior h {9\rrS k \v^') or, more briefly, hk {9), represents 

A’s (subjective) synthesis distribution at stage k. 

If the following assumption holds too, 

c) for each k, the conditional probability which A assigns to the 
event u gk (•) shape parameter will be Vk” , given 
and 9, does not depend on 9: that is, £ [yk\m^ k ~ 1 \v^ 1 \9') = 



1 It can be shown that these assumptions can be relaxed without changing sub- 
stantially the results (Morris (1977)). 
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then Morris’ (simultaneous) aggregation algorithm (1) can be 
written in a recursive form as, 

, (f)) _ l {p^klvk, rrf k ~ l \ y( k ~ l \ 9) • h fc _i (0) 
k f 0 i (m k \v k , «(*-!), 0) ■ h k ^i (9) d9 

where £ (m k \v k , rrS k ~ l \ v^ k ~ l \ 9 ) is the subjective conditioned like- 
lihood function of 9 for the only observation m k , given v k and also 
the location and shape values provided by the k — 1 previously 
consulted experts. 

As regards the arduous assessment of the function £ (•) in (2), 
the relation £ (m k \v k , m^ k ~ l \ v^ k ~ l \ 9 ) = £ k (9) /£ k -\ (0) allows to 
use Morris’ (simultaneous) result, 



k 

£ k {,e)^C k {9)-\[g l {9) (3) 

Z — 1 

where the subjective calibration function C k (9) encapsulates the 
investigator’s state of knowledge about each expert’s probability 
assessment ability and the degree of dependence among the k 
experts. 

In short (for details, see Morris (1977)), let r* denote the 
i- th performance indicator, defined as Qfs cumulative function 
Gi (-| rrii,Vi) evaluated at the true value of 9: C k (9) expresses the 
admissibility degrees which A assigns to each possible 9 value 
looked at as the realization of the £;-dimensional quantile vector 
r = [a]; = i k - Technically, C k (•) is nothing but a subjectively 
assessed density <f k (•) of r, conditioned on v and 9, looked at as 
a function of 9 (for fixed m): in symbols, the relation between the 
so-called performance function 4> k (-) and the calibration function 
C k (9) is, 



<t>k (t|v, 9) = 4> k [G (0|m, v) |v, 9] = C k {9) (4) 

where G (9) = G(0|m,v) denotes the vector [Gi (9\mi, Vi)\ i=l k . 

Whenever the investigator has only some pieces of informa- 
tion about the experts — an ‘information block’ which is not 
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adequate to construct an empirically founded probability distri- 
bution of their performance indicators — the fiducial argument 
(Fisher, 1956) can be used for inductively modelling the calibra- 
tion function, enabling it to be specified with a relatively small 
number of assessments and warranting A’s beliefs about the ex- 
perts in terms of personal long run frequencies (Monari and Agati 
( 2001 )). 

With the following notation: 



G(d)= Gi(6) 



I i=l ,. ..,k 

~ t = [ti ] ' =lj ^ with U = In [ti/ (1 - tj)]; 
— c as normalization constant; 



, with Gi (0)=ln [Gi (9)/{l-Gi (9))]-, 



the resulting fiducial calibration function can be written as, 



C k {9) = C k (9; t,S) = 



= c-n{G < (0)-[l-G<W]r 1 -expj-| [G(«)-t]'s-‘ [g (0) - t] | (5) 

It’s worth noting that the function (5) is univocally defined by: 

— A’s assessment t = [ti]' i=l k of the performance indicator r; 

— the subjective variance-covariance matrix S, reflecting A’s in- 

formation about the variability and the reciprocal dependence 
of the experts’ performance indicators. 



3 Selecting and Stopping Rules 

The purpose of expert consulting is reducing the investigator’s 
uncertainty about the unknown quantity 0. So, in designing and 
performing the sequential process, it is reasonable to found the 
selecting and stopping rules on some criterion of informativeness. 

In particular, though no single number can convey the amount 
of information encapsulated in a density function, a synthetic 
measure of the ( expected ) additional informative value of a not- 
yet-consulted expert Qj-k is indispensable for selecting the one 
to be consulted at stage k, especially when A’s calibration as- 
sessments, together with the shape parameters provided by the 
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experts, lead to not-coinciding preference orderings. And, analo- 
gously, as likelihood functions and posterior densities can display 
a wide variety of form, a synthetic measure of the reached knowl- 
edge degree about 9 is needed for picking out the ‘optimal’ stage 
k* at which data acquiring can be stopped. 

Suppose the investigator A is performing the process of re- 
vising beliefs in light of new data according to the algorithm de- 
scribed in Section 1. He has specified his prior h 0 ( 9 ); each of n 
contacted experts Qj has revealed the variance Vj — assumed as 
uninformative about 9: see b) in Section 2) — of his own density 
gj (9), and A has already consulted A: — 1 of them, so obtaining the 
locations of k — 1 expert densities: he is now at stage k of the pro- 
cess (Figure 1), and must select one among the not-yet-consulted 
experts Qj. * (j = 1, 2, . . . , n — k + 1). 




Fig. 1 . Flow-chart of the sequential procedure. 
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For each Qj- k , the investigator A assesses — conditionally on 
Vj, on the basis of the information at his disposal (including all the 
expert locations m* revealed up to stage k — 1) — the parameters 
of the &-stage calibration function Cj- k {9): that is, tj, Sjj and the 
covariances Sji (or the linear correlations r^) between Qj. k and 
each already-consulted expert Qi, i = 1,2, ... ,k — 1. 

At this point of the procedure, no Qj- k has revealed the loca- 
tion value rrij of his own gj (9): the several ‘answers’ rrij which 
each can virtually give are not all equally informative, so the (in- 
formative) value of each expert at the k- stage — to be measured 
with regard to A’s current knowledge 2 of 9 reflected in the poste- 
rior density hk-\ (9) of the previous stage — is an expected value, 
calculated by averaging a selected measure of relevant informa- 
tion about 9 in Qj- k s answer over the space Mj of the virtually 
possible rrij values. 

By reasoning in a knowledge context — which is an induc- 
tive context, where an expert opinion is more relevant the more 
it is able to modify the investigator’s posterior distribution on 
the unknown quantity — a suitable measure of Qj-k s informative 
value can be the expected Kullback-Leibler divergence of the den- 
sity hj-k (9) with respect to the previous stage posterior h k - i (0), 

E[KL(h j;k ,hk-i)] ■= 

[ f (mj-kivjik, u (fc_1) ) • KL (h jik , h k _ x ) dmj (6) 

JMj 

where the KL-divergence (Kullback (1959)), 

KL ( hj , k , hk-J := [ h rk (9) • In [hj, k (9) /h k . t {9)\ d9 (7) 

J0 

measures indirectly the information provided by an answer rrij-k 
in terms of the changes it yields on the density h k - X (9). The 
conditional density / (•) in (6) is equal to the denominator of (2) 
read as a function of nij. k and normalized; when assumptions a), 



2 In fact, all the other elements being equal, the more A is uncertain about 9, the 
more an answer mj is worthy. 
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b) and c) hold, it can be determined as 
/ {jnj-k\ v j-k, nnf k ~ l \ = f jf 

( 8 ) 

where the density f (m^' k \ w^ ;fc )) — and analogously / ( W fc ^ | 
y( fc-1 )) — is equal, up to the normalization term, to the denom- 
inator f & £ (m,( k ) \v( k \ 0) ■ h o (0) d9 of (1), read as a function of 

The expert Q*. k presenting the greatest expected KL diver- 
gence is, at stage k, the most informative: but is he an expert 
worth consulting? The answer is yes, if the information he pro- 
vides is, on average, enough different from what A already knows 
about 9 , i.e. if the expected divergence of hj* ;k (9) with respect to 
hk- 1 (9) is not less than a predetermined value 5 (0 < <5 < oo). 
About the choose of the threshold 6, a very useful tool is the 
‘calibration scheme’ proposed by McCulloch for deciding whether 
a KL-divergence value is a large or a small one (see McCulloch 
(1989)). 

So the selecting rule can be expressed as follows. Consult the 
expert Q*. k such that 

E [KL (V*, >i*-i)] > E [KL (h j]k , h k ^)} j ± j* (9) 

on condition that 

E[tfL(Vi*A-i)] ><5 (10) 

If Q*. k does not satisfy (10), then proceed to a 2-th order analysis: 
that is, consult the pair (Qj- k , Q u -,k)* presenting the greatest ex- 
pected KL-divergence, provided that it is E [KL (h^ >u y. k , h k ~ i)] > 
5; otherwise contact a new set of experts and perform a new pro- 
cess by using the posterior h k ~\ {9) as a new prior h 0 (9) . 

The expert Q*. k satisfying (10) becomes just Q k , the “k- stage 
expert”. By consulting him, A learns the location m k of the den- 
sity g k (•): now, the £:-stage calibration function C k (9) is univo- 
cally defined, and consequently, the likelihood function £ k ( 9 ) and 
the posterior density h k (9) too. 

In theory, the investigator should stop the consulting only 
when the aggregated knowledge about 0, reflected in the posterior 
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density, is ‘inertially stable’: i.e., only when additional experts, 
even if jointly considered, are not able to modify appreciably the 
synthesis distribution, on the contrary they contribute to its in- 
ertness. But too many experts could be needed for realizing such 
a stopping condition. It can be weakened by requiring just the 
knowledge about 9 deriving from expert answers to be enough 
for A’s purposes. A measure encapsulating the strength of the 
experimental data in determining a preference ordering among 
‘infinitesimally close’ values of 9 is Fisher’s notion of information. 
The value of the observed information I (•) at the maximum of 
the log-likelihood function, 



h (0max) := ~d 2 /d9 2 ln4 (Cx) (11) 

is a second-order estimate of the spherical curvature of the func- 
tion at its maximum: within a second-order approximation, it 
corresponds to the KL-divergence between two distributions that 
belong to the same parametric family and differ infinitesimally 
over the parameter space. 

So, the stopping rule may be defined as follows. Stop the con- 
sulting at stage k* at which a pre-selected observed curvature A of 
the log-likelihood 3 valued at 9 := 9 max has been reached, 

Ik * (U > A (12) 

For deciding whether a curvature value / (0 m ax) = w is a large 
or a small one, a ‘calibration’ can be performed by thinking of 
a binomial experiment where a number x = n/2 of successes is 
observed in n trials and finding x such that I (Pml = 0.5) = w, 
where Pml — 0.5 is the maximum likelihood estimate of the bi- 
nomial parameter p. Table 1 shows a range of x values with the 
corresponding w curvature values. The simple relation x = w/S 
holds: so, for example, if w = 120, the width of the curve In 4 (9) 
near 9 := 0 max is the same as the curve In £ (p) at Pml = 0.5 when 
x = 15 and n = 30. 

3 It can be used also the curvature of log-posterior, so that the whole knowledge, 
including the prior one, is considered. 
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Table 1. ‘Calibration’ of curvature values: relation between x and w values. 



X 


1 


2 


5 


10 


15 


20 


25 


30 


40 


50 


w 


8 


16 


40 


80 


120 


160 


200 


240 


320 


400 



4 Case-study and Concluding Remarks 

The behavior of the algorithms proposed in the previous section 

— and implemented (Agati and Stracqualursi (2001)) in MATHE- 
MATICA — has been investigated in simulation and experimental 
studies. 

In this section, the results from medical data are synthetically 4 
presented to exemplify how the selecting and stopping rules work. 
Particularly, data in Table 2 regard a sequential consulting pro- 
cess of n = 4 orthopaedists, performed by an Italian research 
laboratory about the long-term failure log-odds 9 of a new hip 
prosthesis. A fifth surgeon — which play the role of investigator 

— has assessed the calibration parameters, without modifying 
them in proceeding from a stage to the successive one. He has 
also (subjectively) chosen the following thresholds: 

— S = 0.02: by reading this value in McCulloch’s scale, at stage 

k the most informative expert Q*. k is consulted only if the 
expected KL-divergence of hj*# (9) with respect to h^-i (9) 
is not less than the KL-divergence of a Bernoulli distribution 
B(p) with p = 0,5 from a Bernoulli distribution with p = 
0.65; or, in other words, only if stopping the process at stage 
k — 1 instead of proceeding to stage k involves, on average, 
an information loss larger than that one yielded by using a 
5(0,65) instead of a 5(0,5); 

— A = 120: by using the scale proposed in Section 3, the consulting 

process is stopped at stage k* at which the observed curvature 
of the log-likelihood function In £(9) valued at 9 := 9 max is the 
same as the function ln£ (p) at Pml = 0.5 when, in a binomial 
experiment, n = 30 and x = 15. 

4 A more detailed discussion about relevant methodological and technical issues 
involved in applications concerning data of this kind is the subject of a paper in 
preparation. 
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Table 2. Input data for the sequential consulting of four orthopaedists about long- 
term failure log-odds of a new hip prosthesis. 



Qj 


Vj 


tj 


s jj 


C'i 


rj 2 


Tj 3 


rj 4 


Q i 


0.150 


0.45 


1.20 


1 








Q 2 


0.145 


0.65 


1.50 


+0.20 


1 






Qs 


0.120 


0.75 


1.70 


-0.05 


+0.50 


1 




Qa 


0.110 


0.45 


1.10 


+0.10 


+0.10 


+0.10 


1 



In this study, the conditions a), b) and c) mentioned in Section 
2 can be held to be satisfied. In fact: a) it rests on empirical 
evidence — and the experts confirm it — that the failure log-odds 
9 can be supposed as Gaussian; b) it is reasonable to think the 
probability the fifth orthopaedist assigns to the event “the experts 
will give the variances [i>i, ..., n 4 ] / = v” is the same for all 9 values: 
so the surgeons’ stated variances alone give no information able 
to change the investigator’s beliefs about 9 ; c) it is reasonable as 
well to assume the conditional probability the investigator assigns 
to the event “the expert Qj. k will give the variance Vk\ given 
the shape and location values provided by the k — 1 previously 
consulted experts, is the same for all 9 values. So the combining 
algorithm outlined in Section 2 has been applied, as well as the 
selecting and stopping rules suggested in Section 3. 



Table 3. Output of the proposed sequential procedure in the consulting of four 
orthopaedists about long-term failure log-odds of a new hip prosthesis. 



Qj 


Stage k = 1 
E [KL (h ja ,h 0 )] 


Stage k = 2 

E[/tL(/p; 2 ,/ii)] 


Stage k = 3 
E [KL (h j;3 ,h 2 )} 


Q 1 


1.41487 


1.92935 


— 


Q 2 


1.35582 


1.52293 


1.42842 


Qs 


1.42427 


1.72981 


1.93624 


Qa 


1.60348 


— 


— 




1 


t 


t 


Qj;k 


Qa 




q 3 


m k 


-1.208 


-1.992 


-2.752 


Ik ($max) 


18.713 


53.492 


138.984 (> 120 = A) 
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Table 3 summarizes the results of the sequential process, while 
Figure 2 shows the posterior distributions hk (0) at each stage. 

For k = 1, the selecting rule proposed in Section 2 chooses the 
expert Q 4 : really he offers the smallest variance (v 4 = 0.110), and 
also the investigator’s uncertainty about his performance indica- 
tor is assessed to be the smallest (544 — 1.10). Q4 ! s answer (m 4;1 = 
— 1, 208) leads to a curvature value I 4 (0 max ) = 18.713 < 120 = A: 
so the process goes on. At stage 2, the selecting rule shows its use- 
fulness: in fact, the Vj, tj and Sjj values 5 don’t lead to a unique 
preference ordering. The most informative expert Q\ is selected 
by the algorithm 6 and mi ;2 is observed. The curvature value is 
I 2 (#max) = 53.492 < 120 = A: the consulting proceeds. At stage 3, 




Fig. 2. Investigator’s distributions at stages 0 (i.e., the prior), 1, 2 and 3 of the 
sequential procedure. 



the preference for Q 3 instead of Q 2 is also (but not only) motivated 
by the correlations with Q p a negative correlation (r 3 i = —0.05) 
is more informative than a weak positive one (r 2 % = 0.20). The 
observed m 3) 3 leads to J 3 (0 max ) = 138.984 > 120 = A. The pro- 
cess is stopped: the expert Q 2 is left out of the consulting and 
the stage-3rd posterior h 3 ( 9 ) — whose location and shape values 
are, respectively, —1.873 (the median, here coinciding with the 
arithmetic mean and the mode) and 0.084 (the standard devia- 
tion) — can be regarded as the synthesis expression of the expert 

5 The correlations between Q 4 and the other experts are all equals: so they don’t 
come into play. 

6 It’s worth noting that the value m 4 ; i observed at stage 1 has modified, at stage 2, 
the previous-stage preference ordering: for this reason, the selecting at each stage 
one only expert is to be preferred to selecting a set of experts (simultaneously). 
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knowledge about the long-term failure log-odds 6 of the new hip 
prostheses. 

By looking at the selecting and stopping output of the case- 
study, the behavior of the informativeness criteria appears to be 
coherent with the intuition, so giving an empirical support about 
the soundness of the proposed selecting and stopping algorithms 
in performing an efficient sequential consulting process. 
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Abstract. In this paper a finite mixture model with a specific weights for each 
observation is introduced. The logistic transformation of these weights is mod- 
elled through a markovian field, with space autocorrelations of Gaussian type. This 
specification is particularly useful for desease mapping issues: some implementa- 
tion difficulties are shortly discussed, together with the problem of the choice of the 
mixture’s components number. 



1 Introduction 

Disease mapping is the representation and analysis of maps of 
disease incidence or mortality data. This issue has been recently 
addressed in the framework of Bayesian hierarchical modelling, 
considering a Poisson model at the lowest level of the hierarchy 

[yi\0] ~ Poisson (QiEi) (1) 

where t/j is the count of new cases or deaths in a given period 
for a given pathology in the zth of n areas, Ei is the expected 
count we would observe under a standard rate, and di is the un- 
observed (true) relative risk; spatial heterogeneity usually occurs 
when count data are referred to small areas and/or rare diseases. 

The common formulation of the higher level of the hierarchy 
(Besag et al. (1991), Breslow and Clayton (1993)) borrows its 
structure from the framework of the Generalized Linear Mixed 
Model (GLMM), being defined by 

(Pi = Ui + £i, ( 2 ) 

where the log-relative risks = log#* are modelled by means of 
two mutually independent random effects: £jS that capture un- 
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structured (extra-Poissonian) variation, being distributed as 

[e.lTe] ' N (0, 77 1 ) , i — l, . . . ,n (3) 

and WjS that model the heterogeneity caused by unmeasured co- 
variates (risk factors) inducing spatial correlation among counts 
of nearby areas. A popular joint specification relies on the Con- 
ditional Autoregressive (CAR) prior for u = (ui, . . . ,u n ) (Besag 
and Kooperberg, 1995), given by the following improper density 



M'ju] oc (ju) n/2 exp 



7u 

2 



y y w is («* - « s 

i s^i 



2 



(4) 



where weights Wi s are generally specified according to the adia- 
cency model with W{ S = 1 if area i and s share a common bound- 
ary and 0 otherwise. Bayesian model specification is completed 
assigning a suitable prior distribution to parameters and j u : 
see, for example, Bilancia and Pollice (2000) for details. 

Many authors noticed that the above mentioned BYM model 
(Besag, York and Mollie) may induce oversmoothing when the 
underlying marginal distribution of the relative risk is discontin- 
uous, including a finite number of well separated levels of risk. 
To overcome this issue, Knorr-Held and Rafler (2000) introduce 
a model providing a definite segmentation of the space of relative 
risks based on a modification of Voronoi tessellations; in a similar 
fashion, Green and Richardson (2000) describe the heterogene- 
ity of observed counts by means of a finite mixture models with 
spatially autocorrelated allocation counts modelled by a Potts 
process. A companion paper of the latter is due to Fernandez 
and Green (2000), where alternative allocation mechanisms are 
explored. 

All the models described in the aforementioned papers are es- 
timated by the Reversible Jump MCMC technology, and thus are 
hardly amenable to the everyday use by researchers not experi- 
enced in the subject. In this paper we propose a simpler model 
based on the hierarchical specification described in Pollice and 
Bilancia (2000) , which can be easily implemented using the multi- 
purpose package for Bayesian analysis Winbugs 1.3 (Spiegelhalter 
et al. (2000)). 
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2 The Model 

Following Pollice and Bilancia (2000), the possibly discontinuous 
risk structure (k levels of risk) is taken into account by assuming 
that counts are conditionally independent among areas and dis- 
tributed according to the following ^-components mixture model 
having area-specific component weights 

[Vi\^i z ] ' Poisson t = (5) 

Pr (zi = j\n) = TTij, j = 1, . . . ,k, ind. over i = 1, . . . ,n (6) 

with the obvious restriction Y^!j=i ^ij = 1- The dependence of 
mixture weights on the explanatory variables is taken into account 
by the following multicategorical logit model considering the A;th 
category as a reference category 

log (7r<j/7r* fc ) = T}ij, i = j = (7) 

Expression (7) can be reformulated as 

_ I[j= k } + (1 - I[j=k}) exp (rm) , oN 

~~ / N V°) 

1 + Efc=i ex P KVih ) 

where I\j=k] is the indicator function of the kth. category (for 
which the linear predictor is identically null). Spatial dependence 
is achieved modelling the weights 7 via the linear predictor r/ij 
taking the following form 

Vij = Po + i = 1, • • • , n, j = 1, . . . , k - 1 (9) 

where Uj = {u\j , . . . , u n j) is given a first order CAR prior (4) 




In the above formula di denotes the set of first order neighbours 
of area i. Model specification is completed by assigning a flat prior 
to the common intercept /3 0 , i.e. 



[/do] OC 1 



(11) 
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Note that the parameter /3q has to be included in the model when 
using Winbugs 1.3 for its estimation: the flat specification (11) 
is compulsory as well. When such a constant term is included 
in the linear predictor, an identifiability sum-to-zero constraint 
Y^ = i u ij = 0 must be introduced: Winbugs 1.3 automatically im- 
poses it “ on the fly” by recentering the samples around 

their own mean at every MCMC iteration g. 

Finally, a usual specification for the prior distribution of the 
smoothing parameter is given by 

[7u] ~ Gamma ( a u , b u ) (12) 

Hyperparameters a u and b u are assumed to be known: their setting 
is notoriously difficult and some sensitivity analysis is generally 
required. A similar specification is used for the inverse variance 
7 e of unstructured random effects 6i in the BYM model. 

According to the Bayesian clustering framework, the ith area 
is assigned to the j*th level of risk when the posterior distribution 
of the allocation variable Zi reaches its mode at j*, i.e. when 

argmaXjPr (z { = j\y) = j * (13) 

After performing the classification of all areas on the basis of a 
MCMC output-based approximations of (13), maps can be drawn 
by coloring individual areas accordingly: the resulting choroplet 
is usually very informative for public health assessment and mon- 
itoring. 

Other meaningful posterior summaries that can be easily es- 
timated by the MCMC output are the posterior mean of the jth 
level of risk E(ipj\y), and the posterior expected relative relative 
risk of area i E(ip Zi \y) (notice that the proposed hierarchical spec- 
ification does not contain area-specific relative risks in the likeli- 
hood). 

A suitable choice of the prior distribution of relative risks is 

[ipj] Gamma (a, /3) (14) 

The choice of hyperparameters in (14) is not an intuitive task and 
deserves some attention: a central difficulty with Poissonian mix- 
tures stems from the fact that it is not possible to be informative 
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on the prior distribution of ipj, for the obvious reason that ipj is 
both the mean and the variance of the distribution of the counts. 
Several authors discuss weakly informative settings: among these 
Dellaportas et al. (1997) choose a = ft = 0.01. However, the lat- 
ter prior is sharply peaked in a neighborhood of zero, favoring 
the inclusion of hardly identifiable components in the posterior 
distribution (furthermore, for the dataset presented in this pa- 
per the use of a large prior variance led to unstable and switching 
MCMC simulations). We rather adopt a default option introduced 
in Viallefont et al. (2000), by setting a > 1 to avoid the aforemen- 
tioned peak effect, and the prior mean a//3 equal to the median 
of observed Standardized Mortality Ratios. 

3 Model Choice 

To compare mixture models having different number of compo- 
nents or different models at all (for example the BYM and the 
proprosed model) the use of the recently introduced Deviance In- 
formation Criterion (DIC, Spiegelhalter et al. (2001)) is suggested. 
For a generic model with possibly vector valued parameter £, the 
Bayesian deviance is defined by 

D (0 = —2 logp (y\£) + 2 log / (y) (15) 

where the likelihood of the current model p(y\t;) is compared to 
the baseline term f(y) depending solely on the data y, and thus 
not affecting posterior inferences. For distributions within the ex- 
ponential family, the similarity with the classical Deviance statis- 
tics can be strengthened by assuming f(y) to be the saturated 
likelihood 

f(y) = Y[p{yi\»{ti) = yi) ( 16 ) 

i 

where /r(£j) is the expected value of p(yi\£i) as a function of £j. 
Notice that £ is usually assumed to be the canonical parameter 
(for example, in the BYM model £ = ip and — ^exp(^)): 
this choice is not compulsory since there is usually little difference 
in using a non-canonical parametrization, as for example is done 
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with the mixture likelihood (5). Simple calculations using (16) 
show that the (saturated) Bayesian deviance (15) turns out to be 



D,(») = 2£ 

i 



Vi log 



Vi 

OiEi 



~Ui J r 0%Ei 



(17) 



where Oi = exp(pi) for the BYM model and Oi = ip Zi for the 
proposed model: expression (17) can be easily implemented in the 
Winbugs 1.3 syntax and monitored at each sweep of the MCMC 
algorithm. The Deviance Information Criterion is defined by 



DIC = D+ Pd (18) 

where D is the posterior expectation of (17) and 



Pd = D - D s [0 



( 19 ) 



0 = ($i, ... , 6 n ), Oi = exp[E(y>j|y)] for the BYM model and 0 * = 
E(if Zi \ y) for the proposed model; Pd is the effective number of pa- 
rameters (see Spiegelhalter et al. (2001) for details), in that this is 
often less than the total number of parameters, due to the interde- 
pendences among parameters at the lowest level of the hierarchy 
(the likelihood) introduced by random effects specified at higher 
levels. Spiegelhalter et al. (2001) justify the proposed criterion by 
noticing that D can be considered as a posterior measure of the 
goodness of fit, while Pd is a penalty term measuring the com- 
plexity of the current model. It is clear that smaller values of the 
DIC indicate a better model. 



4 A Synthetic Case Study 

In this section a synthetic dataset is analyzed to assess the per- 
formance of the proposed model when the marginal distribution 
of the underlying risks shows some discontinuities. The chosen 
areal units consist of the 95 Italian departments in 1994 (notice 
that subdivision is no longer in use since 1995); the synthetic rel- 
ative risk structure is shown in the first map of Fig. 1 (NORTH- 
SOUTH) and it consist of two well separated levels, = 0.8 
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corresponding to the lightest gray and ip 2 = 1.6 to the darkest 
one. Counts i — 1,...,95 were independently simulated by 
the model r/* ~ Poisson (0iEi) taking 6i = tpi or 6i = xp 2 according 
to the NORTH-SOUTH map and considering as expected cases 
Ei those obtained in a previous study concerning the mortality 
for endocrine and related disorders in Italy in 1994 (Bilancia and 
Pollice (2000)), scaled by 10 as to obtain a moderate number of 
areas having few people at risk. Due to the largely varying sizes 
of simulated counts (see the map COUNTS in Fig. 1) and ex- 
pected cases throughout the map, it is not surprising that the 
Standardized Mortality Ratios SMR; = yt/Ei , i = 1, ... ,95, re- 
sult in a quite unstable map (SMRs in Fig. 1) where the noise 
indeed dominates, though some evidence for a discontinuous N-S 
trend is still present. 

The four following model were estimated using the simulated 
dataset: 

• BYM-A, a standard Besag, York and Mollie model (Section 
1) with [ 7 u ] ~ Gamma(o tt ,6 u ), [ 7 e ] ~ Gamma(a E ,6 £ ), a u = 0.5 
and b u = 0.0005 (neutral specification, see Bilancia and Pollice 
(2000)), a e = 0.01 and b e = 0.01; 

• BYM-B, same as BYM-A except that a u = 0.01 and b u = 0.01 
(such hyperprior setting generally results to bess less neutral 
and imposes a deeper regularization of the map); 

• MIXCOV2-A, the proposed model with k = 2 components, 
a = 2 and a/fi equal to the median of observed SMRs, a u = 
0.5 and b u = 0.0005; 

• MIXCOV2-B, same as as MIXCOV2-A except that a u = 0.01 
and b u = 0.01. 

To assess their performance with respect to simulated risks, esti- 
mated models were ranked by means of the global Mean Squared 
Error, i.e. 




where §i = E[exp(</?j)|t/] for BYM-A and BYM-B, whereas §i = 
Ei[ip Zi \y] for MIXCOV2-A and MIXCOV2-B models; from a pos- 
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terior inference point of view, models were ranked on the basis of 
the DIC. Results are reported in Table 1. 




Fig. 1. NORTH-SOUTH: the synthetic risk structure with t/>i = 0.8 correspond- 
ing to the lightest gray level and ijj 2 = 1.6 corresponding to the darkest one. 
COUNTS: simulated counts, y;, i = 1, ... ,95. SMRs: Standardized Mortality Ra- 
tios, SMR; = m/Ei. BYM-B: posterior estimates of relative risks from the BYM-B 
model. MIXCOV2-A: posterior estimates of relative risks from the MIXCOV2-A 
model. CLASS: posterior classification of areas based on allocation probabilities 
(13), within the MIXCOV2-A model. 



Notice that both the DIC criterion and the MSE agree in 
choosing the mixture model rather than the BYM; the better 
model according to the DIC is MIXCOV2-A, and moving to the 
latter from BYM-B the relative decrease of the MSE is about 
18%. A careful look at maps BYM-B and MIXCOV2-A in Fig. 1 
reveals that the latter produces a more reliable de-noising of the 
observed SMRs. 
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Table 1. Results of the model choice criteria for estimated models. D\ posterior 
mean of the Bayesian deviance (17). pd : effective number of parameters. DIC: 
Deviance Information Criterion (18). MSE: global Mean Squared Error (20) 





D 


PD 


DIC 


MSE 


BYM-A 


93.39 


62.05 


155.44 


0.1792 


BYM-B 


93.21 


60.19 


153.40 


0.1785 


MIXCOV2-A 


86.46 


29.06 


115.52 


0.1462 


MIXCOV2-B 


87.00 


29.33 


116.33 


0.1456 



The proposed model also drastically reduces the effective num- 
ber of parameters po , due to the intrinsically discrete structure of 
the space of relative risks. The last map in Fig. 1 (CLASS) shows 
the classification of areas made by the MIXCOV2-A model on the 
basis of the posterior allocation probabilities (13) (estimated by 
suitable Monte-Carlo averages): the agreement with the original 
risk structure is remarkable. 

It is anyway worth noticing that in more realistic situations 
(characterized by a less discontinuous risk structure) the discrep- 
ancy between the performance of the proposed and the BYM 
model could sensibly be reduced: further extensive simulations 
are needed to assess this key point. 
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Abstract. The concept of second- (and higher-) order interaction is widely used in 
categorical data analysis, where it proves useful for explaining the interdependence 
among three (or more) variables. Its use seems to be less common for continuous 
multivariate distributions, most likely owing to the predominant role of the Mul- 
tivariate Normal distribution, for which any interaction involving more than two 
variables is necessarily zero. In this paper we explore the usefulness of a second- 
order interaction measure for studying the interdependence among three continuous 
random variables, by applying it to a trivariate Generalized Gamma distribution 
proposed by Bologna(2000). 



1 Introduction 

The concept of interaction is ubiquitous and somewhat ambiguous 
in statistics. In general terms, it is meant to denote a joint feature 
of a set of variables, which in some sense perturbs, or modifies, 
the behaviour of those variables taken separately. This is a vague 
definition, which needs appropriate qualifications to be formally 
handled. Such qualifications can be specified according to different 
points of view: 

• direction: an interaction assumes a different interpretation 
when the relationships we are examining are directed ( asym- 
metric ) or undirected ( symmetric ). From this point of view, 
we can distinguish between 

— an asymmetric specification : an interaction is a joint effect 
of a set of explanatory variables on a set of response vari- 
ables, which modifies the main effects the variables would 
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have if they acted separately; this is the meaning of the 
term interaction in regression-type models (e.g. ANOVA, 
logistic regression, and the like); 

— a symmetric specification: an interaction is a feature of the 
joint distribution of a set of variables, which causes this 
distribution to depart from what it would be were the vari- 
ables independent (in broad sense); this is the meaning of 
the term interaction in correlation-type models (e.g. log- 
linear models for contingency tables). 

• level: an interaction can be measured at different levels of 
characterization of the distribution under examination. From 
this point of view, we can distinguish between 

— a distributional interaction , when the interaction (asym- 
metric or symmetric) involves the whole distribution of in- 
terest; 

— a moment interaction , when the interaction (asymmetric 
o symmetric) involves only one or a few moments of the 
distribution of interest 

• mathematical form: even when the specification of what is 
meant by interaction is agreed upon, there is still room for 
arbitrariness in its mathematical definition, and therefore on 
how to measure it. As a consequence, in the literature addi- 
tive, log-additive, multiplicative, etc. interactions have been 
proposed. 

Probably the most complete and insightful work on the problems 
and subtleties of the concept of interaction is due to Darroch (see, 
for example Darroch, 1983). One interesting point that arises here 
concerns the parallelism between continuous and categorical dis- 
tributions. Unlike what happens for other ideas and methods in 
multivariate analysis, the concept of interaction, and particularly 
of second-, third- or higher-order interaction, is better established 
and more largely used in the analysis of multivariate categori- 
cal distributions (e.g. through the use of log-linear models) than 
in the analysis of continuous multivariate distributions. In the se- 
quel, we shall argue that this is due to the predominant role among 
the latter of the Multivariate Normal distribution, for which all 
interactions of order higher than one are absent. 
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In this paper, we define measures of distributional second- 
order interaction, both symmetric and asymmetric, which can be 
used for both continuous and categorical variables, but focus on 
their use for the former. In order to appreciate the potentials of 
analyzing interactions of order higher than one in the continuous 
case, it is necessary to turn to non-Normal multivariate distribu- 
tions: as an example, we explore the usefulness of the proposed 
measures of second-order interaction applying it to a trivariate 
Generalized Gamma distribution proposed by Bologna(2000). 

2 Building Interaction Measures for 
Multivariate Continuous Distributions 

According to Bologna, Lovison (2001), there are basically two 
approaches in the literature for building up (higher order) inter- 
action measures for multivariate continuous distributions: 

1. choose a sensible interaction measure for categorical variables, 
and try to extend it to the multivariate continuous case (this 
is essentially the approach followed by Holland e Wang, 1987, 
in introducing their local dependence function); 

2. define from scratch an interaction as a ’’departure from a ref- 
erence model” and build up a measure with the property of 
being zero when the model holds (this is essentially the ap- 
proach followed by Whittaker, 1990, in introducing his partial 
derivative measures of interaction). 

2.1 Building a Second Order Symmetric Interaction 
(SOSI) Measure 

• The Holland and Wang Approach 

Holland e Wang(1987) proposed a measure of symmetric in- 
teraction for bivariate distributions extending the Cross Prod- 
uct Ratio used for categorical variables to the continuous case. 
Along this line of reasoning we choose the log Ratio of Cross 
Product Ratios (RCPR) as an optimal measure of second or- 
der interaction among three categorical variables. Let X , Y, Z 
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be three binary r.v.’s, each taking the two values {0,1}, and let 
fx,r,z( x ,y, z ) = Pr{A = x,Y = y,Z = z} be their joint p.d.f.. 
The log Ratio of Cross Product Ratios is: 



log RCPR = log 



fx,Y,z{ 1, 1, l)/x,r,z(l, 0, 0)/x,y,z(0, 1, l)/x,y,z(0, 0, 0) 
Jx,y,z( 1,0, l)fx,Y,z(l, 1, 0)/x,Y,z(0, 0, 1 )/x,y,z(0, 1,0) 



The extension to the continuous case proceeds from defining an 
infinitesimal parallelepiped with coordinates (x, y, z) at the bot- 
tom left corner and (x + dx, y + dy, z + dz) at the top right corner, 
calculating the continuous analog of log RCPR and finally find- 
ing the limit (for typographical reasons, only in first row of the 
following formula we used /(•) instead of fx,Y,z ('))'■ 



lim 

dx — > 0 
dy — > 0 
dz — y 0 



log 



f(x+dx 1 y+dy i z+dz)f(x+dx,y i z)f(x,y+dy 1 z+dz)f(x,y 1 z) 
f (x+dx,y, z+dz) f(x+dx, y+dy, z)f(x,y, z+dz) f(x, y+dy, z) 



dxdydz 



= d 3 log fx,Y,z{x,y,z) 
dxdydz 

• The Whittaker Approach 

Since in this section we are focussing on the study of sym- 
metric relationships, the distribution of interest is the joint p.d.f. 
fx,Y,z{x,y, z ), whose natural logarithm, by the factorization the- 
orem, can be decomposed as: 

log fx,Y,z(x,y,z) = 9 + \ogh x (x) -flog h Y (y) +log h z {z) + 
log h x ,Y(x,y) +log h x ,z(x,z) + 
log h Y ,z{y, z ) + log h x , Y ,z{x, y, z) 

In order to build up a measure of symmetric second-order inter- 
action by the partial derivative approach, we choose as reference 
model: 

log fx,Y,z{x,y,z) =9 + \ogh x (x) + log h Y (y) + log h z (z) + 

log h x , Y (x, y) + log h x , z (x,z) + log h Y ,z(y, z) 
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which can be termed no second-order interaction model, since it 
involves functions of the pairs of variables ( X , Y ), ( X , Z ), (Y, Z), 
but not of the triplet (X, Y, Z). Following Whittaker (1990): 

d 3 log fx, Y ,z(x,y,z) _ d 3 log h x , Y ,z{x,y, z) 
dxdydz dxdydz 

is a measure of second order symmetric interaction, since it is 
zero if the reference model holds and otherwise it quantifies the 
departure from it. 

• A Unified Measure of Second Order Symmetric Inter- 
action 



Since the measures arrived at through the two approaches 
come up to be the same (a result curiously neglected in the liter- 
ature), we can denote the measure by the common acronym SOSI 
(Second Order Symmetric Interaction): 



SOSI = 



d 3 log fx,Y,z{x,y,z) 
dxdydz 



( 1 ) 



2.2 Building a Second Order Asymmetric Interaction 
(SOAI) Measure 



• The Holland and Wang Approach 

The asymmetric case does not seem to have been considered 
by Holland and Wang. However, according to their approach, it is 
natural to take as an optimal measure of second order asymmetric 
interaction for categorical variables the Difference of Differences 
of conditional logits (DDCL). Denoting by logit/z|x,y(l | x,y) = 

log pr|^Io|a’y} logit of Z conditional on the pair of explana- 
tory binary variables (X,Y), DDCL is defined as: 

DDCL = (logit/z|x,y(l | 1, 1) - logit/y|x,y(l | 1,0)) 
-(logit/z| X) y(l I 0, 1) - logit/ Z | X) y(l I 0,0)) 
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The first step in the extension of DDCL to the continuous case is 
to define a continuous analog of the logit transformation: 

1 fz\x,Y(z+dz\x,y) ' 

, ., f / I \ r ° fz\x,r(z\x,y) 

logitf zlx ,y{z \x,y) = hm = 

dz 

= d log fz\x,v(z | x,y) 
dz 

Then, the extension of DDCL to the continuous case is easily 
found to be: 

jj T ( 9 log fz\x,r( z I x + dx,y + dy) _ 5 log fz\x,r( z 1 x + dx,y) \ _ 
dx -> o L V ) 

dy — > 0 

( d\ogfz\x,Y{z \ x,y + dy) _ 9 log fz\x,r( z I x,y) \'] 

V dz dz ) 

_ 9 2 r ^log fz\x,y(z \x,y) l 

dxdy dz 



• The Whittaker Approach 

Since we are now focussing on the study of asymmetric re- 
lationships, the distribution of interest is the conditional p.d.f. 
fz\x,y( z | x,y), whose natural logarithm, by the factorization 
theorem, can be decomposed as: 

log fz\x,y(z | x,y) = log h z (z) + log h x ,z{x,z) + 

log h Y ,z(y,z) + log h x ,y,z (x, y,z) 

The reference model is here that of absence of interactive effect 
of ( X , Y) on Z: 

l°gfz\x,y{y | x,y) = loghz(z) + \ogh x ,z(x,z) + logh Y ,z(y,z) 

which postulates separate effects of X and Y on Z , but no joint 
effects of the pair (A, Y) on Z. Then: 

d 2 I dlog f z \x,y(z 1 x,y) 
dxdy [ dz 
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is a measure of second order asymmetric interaction, since it is 
zero if the reference model holds and otherwise it quantifies the 
departure from it, i.e. the intensity of the interactive effect of 
{X,Y) on Z. 

• A Unified Measure of Second Order Asymmetric Inter- 
action 



Again, the same measure is obtained along either approach, 
and we can denote it by the common acronym SOAI (Second 
Order Asymmetric Interaction): 



SOAI = 



d 2 

dxdy 



d\ogfz\x,Y( z | x,y) 
dz 



(2) 



Finally, notice that, although from a purely mathematical point 
of view the SOSI and SOAI measures coincide, it is useful to keep 
them distinct to stress their different (symmetric vs. asymmetric) 
interpretation. 



3 The Normal Case 



In this section, we explore the behaviour of the SOSI and SOAI 
measures in the case of a trivariate centered and scaled Normal 
random vector: 

[X,Y,Z\ ~Ar 3 (»,Z),fJ, = 0,Z = P 

To derive the measure of symmetric interaction, it is convenient 
to write the tri variate Normal log-density as: 



l°g fx,Y,z(x,y, z ) = 0+9 x x 2 +9 Y y 2 +9 z z 2 +9 XY xy+9 xz xz+9 YZ yz 
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where: 0 = -i logfl P |) - ^ log(27r) 

Q x _ 1 (1 ~ Pyz) qY _ 

2 | P | 

= 1 (1 ~ Pxy) qXY 

2 | P | 

£,xz _ (P*z - Pxy Pyz) q Y z 



1 (1 Pxz) 

2 I P I 

(Pxy ~ PxzPyz) 

I p I 

(pyz — Pxy Pxz) 



P 



Clearly: 



S0SI = d 3 log fx, y,z { x, y,z) 
dxdydz 



= 0 



In order to study the asymmetric interaction, suppose (X, Y) are 
explanatory variables, and Z is the response variable. Then, the 
logit of the distribution of Z conditional on the pair (X, Y) is: 



logit J z\x,y(z | x,y) = 26 z z + 0 xz x + 0 YZ y 



where: 



aZ 1 (1 Pxy) 

0 ~ ~2 |P| 

qXZ _ ( Pxz - Pxy Pyz) 



0 YZ 



(Pyz — PxyPxz) 

TpI 



whence, obviously, we obtain: 



SOAI = 



d 2 f ^ log fz\x,y( z 1 x,y) 
dxdy [ dz 



= 0 



So, whether we look at it from the symmetric or the asymmetric 
point of view, the trivariate Normal distribution is characterized 
by the absence of second order interactions; moreover, it is easy 
to show that this conclusion extends to the multivariate Normal 
distribution of any dimension, which has all interactions of order 
higher than one equal to zero. This is hardly a surprise, since it 
confirms from a different perspective the well known result that 
the multivariate Normal distribution is completely characterized 
by it first two (multivariate) moments, but it is useful to recall it 
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here, because, keeping in mind the predominant role played his- 
torically by the Normal among the continuous multivariate distri- 
butions, it may explain the scarce interest in the study of second-, 
and higher order, interactions in the continuous case. 



4 A Non-normal Case: a Trivariate 
Generalized Gamma Distribution 



Bologna(2000) proposed a trivariate Generalized Gamma distri- 
bution, generalizing his bivariate Gamma distribution (Bologna, 
1987), by deriving the distribution of the random vector obtained 
by the transformation: 

V = (T 2 )G Y = (U 2 )r, Z=(V 2 )r; r> 0 
when [T,U,V]' 

For the sake of simplicity we consider in this paper only the 
special case in which: 



fj. = 0, <7jj = 1 Vj , S = P = (1 — p)I + pJ, with J — 11' 



i.e. the case of three centered, scaled and equicorrelated Normal 
r.v.’s. In this simplified case, it can be shown that: 



fx,Y,z(x, y, z) — 



MI-iMI-iyi-i) 



8\/27r 3 / 2 ^TP 



-.r 



'V 



e2 



h[— a(x r +y r +z r )— 2t{x t t 2 y r t 2 -\-x r z r / 2 +y T t 2 z r Z 2 )] 



(l _|_ g 2 x r / 2 T{y r / 2 +z r / 2 ) + g 2 y r / 2 r{x r / 2 +z r / 2 ) + g 2 z’V 2 r(x r / 2 +t/ r/2 )) 



where: |P| = (1 - 3p 2 + 2p 3 ), a = r = ^p^- 

Suppose we are interested in the study of symmetric interac- 
tion. Then, it is useful to write the log-density as: 



log fx,Y,z(x,y,z) = d + \ogh x {x)+logh Y (y) + \ogh z {z) + 
log h x , y (x,y) + log h x ,z{x,z) + 
log h Y ,z(y, z ) + log h x ,Y,z{x, y, z) 
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^.3 

where: 9 = log — — == 

8n/2t xWy/\P\ 

log h x (x) = C- - 1) log x - log/i F (?/) = (^ - 1) log y ^y- 

loghzO) = - 1 ) log 2 log/i X ,y(x,?/) = -TX r/2 y r/2 

log h x ,z{x,z) = -rx r/2 z r/2 log h Y ,z(y,z) = - ry r/2 z r/ 2 

log h x ,Y,z{%i Vi z ) = 

log (1 _p g2 x r l' 2 T(y r l 2 +z r l 2 ) g 2 y r l 2 r{x r / 2 +z r / 2 ) e 2z r / 2 r(x r / 2 +j/ r / 2 ) 

We get: SOSI = = VKt 1/, z, F r) 

where ip(x,y, z,r,r) is a quite complicated function of its ar- 
guments. The precise form of such a function can be found in 
Bologna, Lovison (2001); what is crucial here is that, unlike in 
the Normal case, the SOSI measure depends on r, on all of the 
three variables x, y, z (i.e. it is a local measure of interaction), and 
on p only through the reparameterization t = (p 2 — p) / \ P \. 

Figure 1 shows SOSI as a function of (x, y), for p — 0.8 (which 
corresponds to r = —1.53846) and nine combinations of the fol- 
lowing values of r and z: r = 2,4,6, z = 0.3, 1,3. Inspection of 
Figure 1 suggests some interesting conclusions: 

• the three variables interact, most of the time positively (for 
smaller values of X, Y, Z) and less often negatively (for inter- 
mediate values of X, Y, Z)\ 

• as X, Y, Z get large, SOSI approaches zero, i.e. they cease to 
interact; 

• all previous patterns are accentuated as r increases. 

Some insight into the previous result can be gained by consid- 
ering SOSI as a measure of the variation in the first-order interac- 
tion between X and Y (conditional on Z) induced by variations 
in Z\ where SOSI is positive (negative), increments in Z cause an 
increase (decrease) in the interaction between X and Y ; where 
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z=0.3, r=2 z=l, r=2 z=3 , r=2 

;i :i ;i 




Fig. 1. The SOSI measure for various values of r and with p = 0.8. 



SOSI is equal to, or approaches, zero, the interaction between X 
and Y is, or tends to be, constant with respect to Z. 

This interpretation may be further helped by a visual exami- 
nation of the behaviour of the conditional bivariate distributions 
of the pair ( X , F) given Z , with respect to variations in Z. To 
support such examination, Figure 2 visualizes the contour lev- 
els of the conditional distributions of the pair (X, Y ) given four 
specified values of Z(z = 0.3, 1, 1.3, 2), with p — 0.8 and r = 4. 
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Fig. 2. Contour levels of the conditional distributions of (X, Y)\Z, with p = 0.8 
and r = 4, corresponding to four specified values of Z (from top left to bottom 
right: z = 0.3, 1, 1.3, 2) 

Inspection of Figure 2 provides a new interpretation of the 
pattern of signs in Figure 1. In fact, notice how the shape, and not 
only the location, of these four bivariate distributions changes as 
a response to changes in the values of Z, moving from a situation 
of positive bivariate asymmetry (for small values of Z, i.e. much 
less than 1), through a situation of negative bivariate asymmetry 
(for intermediate values of Z , i.e. values around 1), finally to a 
situation of approximate bivariate symmetry (for large values of 
Z, i.e. for values much greater than 1), which is characterized by 
almost elliptical contour levels. 
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Abstract. In this paper, we tackle the study of the relationship between daily non 
accidental deaths and air pollution in the city of Philadelphia in the years 1974 - 
1988. For modelling the data, we propose to make use of dynamic generalized linear 
models. These models allow to deal with the serial dependence and time-varying 
effects of the covariates. Inference is performed by using extended Kalman filter 
and smoother. 



1 Introduction and Motivation 

Various applied fields, like environmental statistics or environ- 
mental epidemiology, deal with time series data in the form of 
discrete or non-normal outcomes. In environmental epidemiology, 
a key problem is the role of serial correlation in the modelling 
framework. Serial correlation on the response is due to different 
sources of causes. First of all, outcomes depend on serially corre- 
lated explanatory variables, so that the time series structure of the 
covariates imparts a highly structured pattern of interdependence 
on the response. Then, the effect of the explanatory variables on 
the outcomes usually lasts some time; for example, the effect of a 
high pollution event on population health spreads over some days, 
although the effect’s mechanisms is unfortunately unknown. This, 
again, depends partly on the serial correlation of the pollutants, 
partly on the natural response mechanism of the human body to 
exposure to toxic agents. 

A natural way to deal with such issues would be to develop 
association models in which the dependence structure within the 
explanatory variables and between covariates and response is cor- 
rectly accounted for. However, in most cases neither the depen- 
dence mechanism on the explanatory variables nor an adequate 
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knowledge of the physical association between response and co- 
variates is known, so that an explicit probabilistic model of asso- 
ciation is rarely available. 

In the environmental epidemiology literature, the usual mod- 
elling strategy (see Brumback et al. (2000), for up to date refer- 
ences) is based on estimating proper Generalized Linear Model 
(GLM) or Generalized Additive Models (GAM) with the assump- 
tion of independent outcomes. The temporal behavior is con- 
trolled by mean of a cyclic function inserted in the regressor term. 
In the diagnostic phase, checks for the presence of serial correla- 
tion in the residuals are performed. In general, residuals’ serial 
correlation happens when the model for the temporal component 
is inadequate to pick up all the fluctuations of the underlying 
outcome behavior. In these cases, a more accurate modeling of 
the temporal component can solve the problem. If this is not the 
case, the modeling strategy can be extended to account for serially 
correlated terms. 

In the literature, two main approaches can be followed to add 
autocorrelation to the standard GLM or GAM setting: either a 
latent autocorrelated time series error is assumed for the model 
(generalized linear/additive model with time series error), which 
means that correlation between two subsequent outcomes is a 
known function of the marginal means of the outcomes and per- 
haps of some additional parameters, or correlation is inserted into 
the model by making the current outcome explicitly depend on 
past outcomes (transitional generalized linear /additive model). 

Various difficulties are related to these extensions, like compu- 
tational difficulties, model checking, etc. In this paper we propose 
to employ a different modelling strategy based on Dynamic Gen- 
eralized Linear Models (DGLMs) (Fahrmeir and Tutz (1994)), 
which extends the first above mentioned approach. We apply 
this modelling framework to the analysis of daily death counts 
in Philadelphia (Kim et al. (1999)). The counts are modelled by 
a Poisson distribution having mean driven by a latent Markov 
process. In this setting, serial dependence is added to the model 
structure by making use of random coefficients supplemented by 
prior distributions adequate to take autocorrelation into account, 
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such as for example random walks. This implies that time- varying 
covariates are allowed to enter the model not only via the condi- 
tional mean but also linked to the latent process. In our view, this 
framework allows to more neatly explore the dependence struc- 
ture of the data. Estimation is performed by extended Kalman 
filter and smoother. 

The outline of this paper is as follows. In Section 2 we in- 
troduce some features of our modelling approach, sketching also 
the inference procedure. Section 3 briefly describes the Philadel- 
phia data set. Finally, Section 4 contains the results and some 
concluding remarks. 

2 Dynamic Generalized Linear Models 

Consider a series of counts {Y t } recorded at equally spaced times 
t = 1 , . . . , T along with a vector of covariates Xt € HP. Assume 
that it is reasonable to divide the vector of covariates into two 
components x t = (x' lt , x ' 2t )' , where the first component, £Ci t , in- 
cludes covariates which we expect to contribute to the ‘nucleus’ 
of the underlying mean tendency of the counts and the second 
component, X 2 t, includes perturbing factors, whose influence can 
be thought of as being constant over time. If the central tendency 
of the counts may be thought of as resulting from the influence 
of both types of covariates, then it is reasonable to model the 
effect of x u by a univariate latent process (j> t = </>t(xit) and to 
fix the effects of a^t- Therefore, we assume that the conditional 
distribution of Y t given 4> t follows a Poisson distribution 



Y t \<j>t~ P(exp {(f)t + x 2t j}) 



with 7 representing the fixed regression coefficients for X 2 t- 
For the latent process </> t we consider the following specification: 

4>t = (wt + x' u /3 t ) 

oj t = 2uj t -i - u ) t - 2 + S t with 5 t ~ N(/j, s , a]). 

Pt = Pt-i+€t with £ t 
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In this formulation oj t is a discrete-time analog of a continuous- 
time cubic spline so that the model appears as a semi-parametric 
model. Also the coefficients of the long-term explanatory variables 
move dynamically in time. This is a crucial aspect of the model. 
Firstly, it is possible to grasp long term changes in the effects of 
the covariates. Secondly, the dynamics on the coefficients allows 
to capture the “carry-over” effect of the covariates which usually 
causes the effect at time t to be influenced by covariates at previ- 
ous times. The previous setting falls into the general framework of 
Dynamic Generalized Linear Model (DGLM) (Fahrmeir and Tutz 
(1994)). The latent process <p t and the parameters 7 can be cast 
in the transition model 



at — Fat- 1 + £t, (1) 

with e t ~ J\f(0,Q), a 0 ~ Af(a 0 ,P 0 ) and a 0 , Po, Q, F hyperpa- 
rameters. The observation model is given by 

VIA ~-p(exp {*;£*,}). (2) 

To exemplify, assume for convenience that x t — (x\t,X2t)'- 
Then, 



fu t \ 

ut - 1 
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where e t ~ 1V(0, diag(crf, 0, <t|, 0)) and z t = (1,0,: ri t ,%)'. 



2.1 Moment Structure 

We will explore the nature of the serial correlation implied on the 
counts by the dynamical structure. Consider first the mean, a t , 
the variance, P t , and the autocovariance function, Pt,t+h , of a t . 
By recursion, it is possible to show that: 



a t = F l ao, 
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with F f = F ■ ■ ■ F , t times and F° = I. Analogously: 



t - 1 

P t = Y' F*QF' fe + F k P 0 F' 



ik 



k = 0 



and 

Con(at +ft ,a t ) = F h P t . 

Therefore, et t ~ A/”(a t , P*), which yields for the latent process: 



~ M{z' t a t , z' t P t z t ) 



and 

C7cw(«J +/l a! t+ft , = z' t+h F h P ,z t . 

We now turn to the moment structure of the observed counts, 
for which we have: 



lit = F(F(F t |c*t)) = E(exp(z' t a t )), 

a 2 t = F(Far(F t |a t )) + Far(F(F t |a t )) = 

= E(exp(z' t a t )) + Far(exp(^a t )). 

This yields 

Ah = exp (z' t a t + z' t P t Zt/ 2) 

= /i?(exp(*t-FUt) - 1) 

= fH+htk(? x P( z t+h Fhp t z t) ~ 1 ), 
from which we derive that: 



Corr(Y t+h ,Y t ) 



exp (z' t+h F h PtZt) - 1 

Vtexp {z' t+h P t+h z t+h ) - l)(exp {z' t P t z t ) - 1)' 



Previous results show that this model setting allows to incorporate 
non stationary second-order processes. 
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2.2 Inference 

DGLMs have two unknown quantities: the state vector ot t and 
the hyperparameters. We summarize the hyperparameters in the 
vector A, we assume for the moment A fixed and known and we 
are interested in dealing first with a t . Let a* t = (c*g, . . . , a! t )' and 
Y t * — (Yi, . . . ,Y t )', the conditional distribution of a y given the 
observations 

T T 

p(oc* t \Yt) oc ^Qp(r t |ai) J^p(o: t |a t _ 1 )p(o:o) (3) 

t = l t= l 

is non-normal. Note that in this case the conditional means and 
the conditional modes are not equivalent. Due to the complicated 
form of the conditional distributions involved, inference requires 
some approximation. Simulation based estimation, in particular 
MCMC methods, has been proposed for dealing with this problem 
(see Shephard and Pitt (1997), among others). In this paper we 
have chosen the approach proposed by Fahrmeir and Tutz (1994) 
and we consider (3) as a penalized likelihood avoiding a Bayesian 
interpretation. More precisely, taking the logarithm of the condi- 
tional density, PL(ay), this yields to 

T 1 

PL{a* T ) = const + ^f(a t ) - -(a 0 - a 0 )'P 0 ^cco - a 0 ) 

t- i 

i T 

-- £(a t - Fcx^yQ-^cct - Fa t -i). (4) 

^ t = i 

with l(ctt) = log p(Yt | at). The function (4) is a penalized log- 
likelihood criterion so the conditional modes a^ = argmax a ^ 
p(a^\Y^), are maximizers of PL(ay). Algorithmic solutions can 
be efficiently obtained by iterative Kalman filtering and smooth- 
ing (Fahrmeir and Wagenpfeil (1997)). The hyperparameters A 
can be interpreted as smoothing parameters and these can be 
selected as minimizers of a generalized cross-validation criterion. 

Diagnostics deserve some care: particularly Pearson-type resid- 
uals loose the usual properties. For more appropriate inference we 
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Table 1. Summary statistics for Philadelphia: Mortality age 65+ , temperature and 
dewpoint (°F), TSP ). 



Variable Mean SD 25% 50% 75% 

Mort 31.5 6.4 26.0 31.0 38.0 

Temp' 54.3 17.8 40.0 55.3 70.3 

Dew 42.3 19.1 27.8 43.5 58.8 

TSP 67.3 26.9 47.5 63.0 72.0 



use the P-scores as suggested by Friiwirth-Schnatter (1996), and 
we perform standard diagnostic tools for generalized linear models 
as well as dynamic linear models for detecting model’s inadequa- 
cies, if any. 

3 Data 

The first substantial analysis of a U.S. epidemiological time series 
data set was by Schwartz and Dockery (1992), using data from 
Philadelphia and it was quickly followed by a number of studies 
of other cities. The main Philadelphia data set used by the re- 
searchers consisted of 14 years of daily deaths data (1974-1988) 
with associated measurements of temperature and dewpoint, i.e. 
the two meteorological variables which are believed to be the most 
important confounders, and five pollutants: total suspended par- 
ticulate (TSP), sulphur dioxide (S02), nitrogen dioxide (N02), 
carbon monoxide (CO) and ozone (03). 

In the present analysis, we have used the same data set as in 
Kim et al. (1999), and we have focused on the analysis of TSP 
effects on deaths in the population aged 65 and over, since this 
is the group most at risk. Table 1 includes summary statistics for 
the daily air pollutant concentration, key meteorology and death 
counts recorded during the study period. 

Figure 1, which plots weekly deaths counts along with a non 
parametric estimate of the temporal behaviour, shows that there 
is an irregular seasonal effect, which cannot be explained solely 
through the dependence of deaths on either meteorology or pollu- 
tion and need to be properly modelled. Moreover, extensive pre- 
liminary analyses show that deaths decrease against both tem- 
perature and dewpoint until a threshold value is reached, around 
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Year 



Fig. 1. Weekly deaths counts for the years of study (1974-1988) along with a 
nonparametric estimate of the temporal behaviour (solid line). Dotted vertical lines 
mark the beginning of each calendar year. 



75° F for temperature and 60° F for dewpoint, after which deaths 
start to increase with increasing temperature or dewpoint. Similar 
analyses for TSP show a general increase in deaths with pollutant, 
although it is questionable whether there is any real effect below 
about 100 /ig/m 3 . To develop the models, we take advantage of 
the exploratory analysis and of previous studies (see Kim et al. 
(1999)) to construct and select sound covariates, although we do 
not aim at building models which are strictly comparable with 
the models already published. 

4 Results and Conclusions 

Our model building strategy started from the simplest models, 
i.e. models including the pollutant and the most relevant meteo- 
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Year 



Fig. 2. Time series of the observed counts along with fitted values. 



rological variables. If variables resulted to be not significant, i.e. 
the corresponding time- varying confidence bands always included 
zero, they were removed from the models. 

As we assumed that the counts reflected an underlying ten- 
dency of the severe air pollution events, combined with adverse 
meteorological conditions, to cause non-accidental death, we in- 
cluded in the X\ t component vector those covariates which mea- 
sured exposure to pollution and meteorological conditions. 

Long term trends and seasonal fluctuations were controlled 
by making use of a discrete version of a spline component mod- 
elled by a second order random walk. Controlling for weather was 
achieved by including a time varying coefficient for the mean tem- 
perature of the previous 3 days less than 80° F and a fixed effect 
for temperature above the same cutoff. Moreover, a time varying 
effect for the mean dewpoint temperature of the previous 3 days 
was also introduced. 
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Fig. 3. Estimated trajectories for the coefficients of temperature below 80 °F and 
of the dewpoint temperature with point-wise 95% confidence intervals. 



All analyses were carried out using R functions and C routines 
written by ourselves. 

Figure 2 shows the time series of the observed counts along 
with the fitted values. 

Results showed a significant and positive effect of TSP on 
health. The pollutant’s effect resulted to be fixed, i.e. the esti- 
mated coefficient variance was near to zero, meaning that the 
three day lagged measure was sufficient to capture the carry-over 
effect of the pollutant. 

Figure 3 shows the estimated trajectories for the coefficients 
of the temperature below 80 °F and of the dewpoint temperature 
and highlights the time varying effects of the variables. The tem- 
perature effect appears to be significant over the study period, 
whereas controversial appears the interpretation of the dewpoint 
temperature. It appears that most of the dynamic behaviour of 
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the time series is captured by dewpoint temperature so that it 
becomes difficult to assess significance of the variable and the 
strength of its effect. 

The aim of our work was to explore the extent to which the 
DGLM setting allows to improve modelling procedures for epi- 
demiological time series studies, where a complex dependence 
structure among variables is observed. Results seem satisfying in 
terms of flexibility of the modelling approach and of parsimony, 
where parsimony is to be intended with respect to the number of 
explanatory and confounding variables that need to be included 
in the model and not with respect to the number of parameters 
to be estimated. 
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Abstract. In environmental surveys, due to environment fluctuations, the popu- 
lation could be spread on a wide study area so that the population density is low 
but many units are concentrated inside small regions. In this case, we say that 
the biological community is clustered. In this setting, we propose the multivariate 
adaptive sampling to estimate biological diversity. In particular, in order to describe 
the diversity of a biological population, we consider several distribution models for 
the abundance vector. By means of a suitable simulation study, limits and advan- 
tages of using adaptive sampling designs are emphasized for each abundance model 
considered. 



1 Introduction 

Statistical Ecology encompasses numerous quantitative method- 
ologies that deal with the exploration of patterns in biological 
communities. In some cases the aim of the study is to test hy- 
potheses about the underlying structure of the ecological com- 
munity. Species abundance patterns relate to the organization of 
a community, including how coexisting species utilize common 
resources such food or space. Ecologists have developed several 
hypotheses in an attempt to explain these abundance species in- 
teractions. At this purpose, the concept of diversity has been used 
as a method for characterizing the structure of species abundance 
in a community. Theoretical distributions like the geometric dis- 
tribution, the brocken-stick distribution or the lognormal distribu- 
tion have been found and related to important aspects of diversity 
such as the Dominance or the Equitability. 

In many environmental and natural phenomena, the population 
investigated is a collection of objects within a particular study 
area. Recently, Barabesi and Fattorini (1998) and Di Battista 
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(2002) have discussed several sampling techniques to estimate the 
species abundance of biological populations. In this paper we pro- 
pose a sampling strategy based on Multivariate Adaptive Clus- 
ter Sampling (MACS) (Thompson (1993), Seber and Thompson 
(1994)) in order to estimate the abundance vector when a clus- 
tered pattern is suspected in a study area. Generally, when Adap- 
tive Cluster Sampling is adopted with rare and patchy popula- 
tions, the resulting sample density estimates have a lower variance 
than those from Simple Random Sampling (SRS). Ecologists have 
shown interest in this sampling technique as many of the popula- 
tions in ecology have a patchy distribution (Andrew and Mapstone 
(1987)) and it is very rare that an ecological population is ran- 
domly distributed. By means of the abundance vector estimation 
we proceed in testing hypotheses about Species Abundances and 
find the statistical model that best fit the data. Several simula- 
tions were performed in order to identify and evaluate the factors 
that influence the gain of adaptive cluster sampling in comparison 
to simple random sampling. 

The paper is organized as follows: in Section 2 we develop a proce- 
dure to estimate the abundance vector of a biological population 
by using adaptive cluster sampling. In Section 3 we introduce 
some of the most commonly used species abundance model. In 
Section 4, in order to stress pro and cons of using adaptive sam- 
pling designs, a thorough simulation study is performed. Results 
and some final conclusion are given in Section 5. 

2 Abundance Vector Estimation 

The diversity of a biological population (e.g. diversity and/or 
dominance indexes) are expressed as function, say g( T), of the 
unknown abundance vector T = (Ti,T 2 , ...,T S ), where the j - th el- 
ement of the vector T, represents the number of units belonging to 
each biological species (j = 1, 2, ..., s). The problem lies in adopt- 
ing a sample design to estimate the abundance vector T. In the 
literature several sampling designs have been proposed. However, 
when the population of the biological species can be represented 
by clustered points in a wide study area, the simple random sam- 
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pling estimators are characterized by a large variance. Moreover, 
multivariate considerations in conventional sampling designs in- 
volve question of survey design. Indeed, the conventional advice 
that, when each variable has a sporadic distribution concentrated 
in a small part of the study region, a different sampling design 
is needed for each variable (Cochran (1977) p. 77) may not need 
to be adhered when an adaptive sampling design is used. In fact, 
with Adaptive Cluster Sampling, the course of the survey depends 
on the observed values, hence it is possible to accommodate in a 
single survey the sporadic distribution of several variables of inter- 
est. The problem can be viewed in terms of multivariate approach, 
where the abundances can be considered as s different variables 
of interest Y), (j = 1,2, ...,s). In this context, for each area unit 
(u\, U 2 , ..., un) labelled by (1,2,...,N), the value of the j-th variable 
is denoted by j/p-, (i — 1, 2, ..., N) and the population matrix may 
be denoted with Y. In the fixed population view, Y is considered 
a fixed set of unknown constants and a sample of size n, drawn 
from the population units, represents a sequence of n labels. In 
adaptive sampling, each unit is defined to have a neighborhood of 
other units associated with it. For example, suppose the popula- 
tion is composed of two subpopulation: 

Pm = {y ■ Vi e C,i = 1, 2, ..., N} 

and 

Pn-m = {y '■ Vi 3 C,i = 1, 2, ..., N} 

with CeRa previously fixed condition. In the multivariate case, 
the condition is specified by a region C in s-dimensional space. 
So, if a member of Pm is selected to be in the initial sample, 
then its neighborhood is added to the sample and if any member 
of the neighborhood is also in Pm, then its neighborhood is also 
selected. Adaptive sampling stops when no more neighbors are 
members of Pm and the final sample consists of the initial sample 
and all adaptively sampled units. At the end of the procedure for 
each initial unit we have a corresponding set of units of Pm which 
we call ” network. Note that, by symmetry of the definition of 
neighborhood, if any unit in a network is selected to be in the 
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initial sample, then every member of the network is adaptively 
sampled and is selected as part of the final sample. Now sup- 
pose an adaptive sampling design is performed. Let us denote by 
g = ii) a collection of indexes drawn from (1,2, ...,L) 

networks, g E G, where G is the set of all possible samples of 
networks. Then, a design unbiased estimator of the abundance Tj 
is 

% = ^ S w v = ^ C 1 ) 

i£g 1=1 

where 

w u = 

reAl 

is the average of the l - th network (l = 1,2 and Z\ « 
Bin(n, MJN ). 

The abundance estimator Tj, ( j = 1,2, ...,s) represents an un- 
biased and consistent estimator of the population counterpart 
(Thompson, 1996). The total estimator of the population is T = 
1 4 T . In addition, a straightforward application of the Central 
Limit Theorem ensures that the abundance vector estimator T = 
(T 1 ,T 2 ,...,T s ) t is normally distributed. 

We point out that the estimator (1) does not depend on the con- 
dition C as it simply determines the network structure before any 
sampling take place. 

3 Some Species Abundance Distributions 

The species diversity of a biological population has always been 
object of environmental studies. Detailed reviews of statistical 
methods to analyze this problem can be found in Patil and Taillie 
(1982) and Gove et al. (1994). 

Given a community, an assemblage of organism living together in a 
particular environment, ecologists attempt to answer a the follow- 
ing question: how do species evolve and relate to one another? The 
simplest measure of diversity is to count the number of species: 
species richness. This approach does not take species abundance 
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into account. Many diversity indices have been developed that in- 
corporate both species richness and the degree of evenness with 
which individuals are distributed among species. There is no index 
that is excellent across discriminant ability (how capable they are 
of detecting subtle differences between sites) , sensibility to sample 
size and to common or rare species. Therefore, it is not possible 
to recommend a single index as superior to all others and choos- 
ing the appropriate index depends on what sort of question is 
being asked. Moreover, descriptions of whole communities by one 
statistic of diversity run the risk or loosing much valuable infor- 
mation. As a matter of fact, a more complete picture is obtained 
by examining the species abundance pattern amenable to a more 
meaningful analysis. Several models to describe the patterns of a 
biological community have been used (Pielou (1975)). Assuming 
that the total number of species is known, we consider the models 
commonly discussed in the ecological literature, all referring to 
how the total niche of a community represented by a stick should 
be broken into pieces. Geometric model, dominance model (May 
(1975)), is typically adopted when one assumes hypothesis that 
each species time to time consumes an amount, say k, of the re- 
source. In this context the abundance for the j-th species is given 
by 

Tf = Tk(l - k) 3 ~ l . (2) 

In contrast in the Brocken-stick model, equitability (Mac Arthur 
(1957)), the species simultaneously consume an amount of the 
resource implying random partitioning of resources, thus 




Finally, the lognormal model, diversity model (Preston (1948)), 
is more common for communities rich in species. One ecological 
explanation for the lognormal distribution refers to the brocken 
stick model, but the stick is broken sequentially and non instan- 
taneously. The model can be specified by 

S(r) = S 0 exp (-a 2 R 2 ) 



( 4 ) 




250 Gattone et al. 



where a = ^=,J? — log 2 (/j//o), cr is the standard error of the 
distribution and So is the number of species in the modal inter- 
val. Following Preston’s convention of expressing abundance, 
is the species abundance in the i-th interval and Jo is the species 
abundance in the modal interval. S(r) thus represents the theo- 
retical number of species in the i?-th interval using the log normal 
model. Note that abundance intervals are expressed in a logarith- 
mic scale. 

We point out that the distribution of resources is most equi- 
table (high diversity) in the brocken stick model, less equitable 
(medium diversity) in the lognormal model and less equitable 
(low diversity) in the geometric series. These models span over 
the whole range of possible species-abundance patterns. However, 
unless there are many species in the samples, it is often very 
difficult to distinguish a brocken stick from a lognormal model 
(Wilson (1993)). 

4 Method: Simulation Data and Model 
Fitting 

Simulation experiments were performed in order to identify and 
evaluate factors that influence the relative gain of MACS to SRS. 
Monte Carlo simulation techniques are used to generate 36 popu- 
lations by the ’’modified Thomas process” of Diggle et al. (1976). 
The study area is represented by a study site (70x70 unit) in 
which parents are randomly distributed, each parent then being 
surrounded by a clump of offsprings. A parent coordinate pair 
(a,b), 0 < a, b < 1, is selected by choosing two random values 
from the uniform distribution. For the j-th species, the number 
of offsprings for a parent is generated as a Poisson variate with 
mean A j. The average clump size for the j - th species is + 1, i.e. 
the number of offsprings plus the parent. Finally, the number of 
parents for the j-th species is calculated as pj = 

The offsprings are spatially distributed about their parents at a 
radial distance selected from an Exponential distribution with 
mean rj and at a random angle uniformly distributed between 
0 and 360. The sampling area was defined as the central area 
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of 50x50 units within the site, to allow for edge effects from the 
modified Thomas process. The model was used to generate 36 
populations models using a3x2x3x2 factorial design with the 
factor levels being: 

• the total of the population, T = was 100, 500 and 

2000. 

• the number of parents, pj, was 5 and 15. 

• the mean distance of offsprings from each parent, rj , was 1, 2 
and 4. 

• the species abundance distribution were the Geometric model 
with k = 0.6 and the Brocken stick model both with s, the 
number of species equal to 7. 

The first three factors cover the range of natural spatial patterns 
that may be encountered in most ecological communities, from 
rare and highly clustered population to common and randomly 
distributed population. The populations were each sampled 2000 
times using MACS with initial sample n = 15 and estimates T macs 
of the abundance vector T were obtained using the estimator (1). 
The aggregative condition was defined by the region 

C = {yij : yij > 0 ,Vi,j}. 

In order to evaluate the gain obtained using the adaptive strategy 
respect to the simple random sampling, conventional estimates 
T srs of T were evaluated using SRS with sample size equal to the 
expected sample size E{v) under the adaptive design (Thomp- 
son (1996) p. 111). Finally, we proceeded in fitting the estimated 
abundances vector with the theoretical abundances distributions. 
In particular, it is worthwhile to note that ecologists, in practice, 
do not judge the appropriateness of models by absolute fit, but 
rather by whether they give a better fit than other models. There- 
fore, once we have the estimates T macs and T srs , Dominance and 
Equitability models were then fitted to these abundance using 
the method of Wilson (1991), i.e. the sum of squared Euclidean 
distance between model and observation on a logarithmic scale: 

S 

Dmod, = ^ T 0 b s (j) — logT morf O)] 



( 5 ) 
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where T obs <j) and T mod (j ) denote observed and expected abundance 
of rank j, respectively. The model giving the best fit to each set 
of observed abundances was then determined as the model with 
the lowest value of D. Out of the 2000 samples, the relative gain 
of MACS relative to SRS is computed by: 



2000 -1 ^^ r “ e ^ ^f a ^ se | MACS) - I (D true - Df alse | SRS) 

9 ~ I{D true - Dfalse | SRS) 



(6) 

where I(D true < Df a i se \ MACS) is the number of times the true 
model is chosen using MACS. Note that, for the purposes of this 
study, s is supposed to be known and empty samples (those that 
fall outside any clumps during the sampling) were discarded. 



5 Results and Discussion 

Results of the simulations performed are shown in Table 1 and 2 
for the Geometric and the Brocken stick populations, respectively. 
In the tables are reported the expected sample size of MACS that 
is the sample size used for the SRS and the relative gain g as 
defined in (6). Perfomance of adaptive cluster sampling relative 
to simple random sampling was highest: 

• in order to detect the Geometric model (Dominance of one 
species to the others) instead of the Brocken stick (Equitabil- 

ity); 

• when the average final sample size was close to the initial sam- 
ple. 

For the less patchy population with many large clusters, i.e. r = 4, 
T = 2000 and p = 5, 15, the adaptive selection process resulted 
in a large percentage of the study area being sampled. On the 
other hand, for the most patchy populations, i.e. r = 1,2, p = 5 
and T = 100, 500, few quadrats were selected adaptively and the 
average final sample size was close to the initial sample. The Ge- 
ometric model has more relatively rare species than the Brocken 
stick so that the sampling fraction is smaller indicating that adap- 
tive sampling may be more favorable in low density population. 
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Furthermore, in the way we have chosen the aggregative condi- 
tion, for the Brocken stick populations, the output is strongly 
dependent on the ability of detection clusters of different species 
whereas if the sampling effort would be determined by only one 
species, MACS performs very poorly in detecting equitability. 
From a planning perspective the uncertainty of the final sam- 



Table 1. Efficiency g of MACS with an initial sample of n = 15 relative to SRS 
with sample size E{y) for the Geometric model. The population parameter were: 
T, the population total, p, the number of cluster, r, the average distance of an 
individual from a cluster center 

T p r 







1 




2 




4 




100 


5 


18 


0.35 


18 


0.33 


18 


0.11 




15 


18 


0.09 


18 


0.15 


18 


0.21 


500 


5 


37 


0.40 


60 


0.25 


55 


0.17 




15 


36 


0.23 


42 


0.06 


46 


0 


2000 


5 


92 


0.25 


195 


-0.05 


316 


-0.19 




15 


138 


-0.01 


182 


-0.08 


395 


-0.14 



Table 2. Efficiency g of MACS with an initial sample of n = 15 relative to SRS 
with sample size E(y) for the Brocken stick model. The population parameter 
were: T, the population total, p, the number of cluster, r, the average distance of 
an individual from a cluster center 

T p r 







1 




2 




4 




100 


5 


18 


0.14 


18 


0.03 


18 


0.11 




15 


18 


0.15 


18 


0.03 


18 


0.21 


500 


5 


37 


0 


39 


-0.01 


41 


0.17 




15 


40 


-0.04 


42 


-0.07 


44 


0 


2000 


5 


112 


-0.07 


225 


-0.15 


319 


-0.19 




15 


175 


-0.14 


292 


-0.11 


343 


-0.10 



pling fraction is a disadvantage of adaptive sampling designs. A 
solution might be the restricted adaptive sampling (Brown and 
Manly (1998)) where a limit is placed on the sample size prior 
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to sampling. Further improvements can be made adjusting the 
aggregative condition in order to consider the number of species 
detected (e.g. weighted sum of counts for several species). Finally, 
as pointed out by Thomposon (1992) pp. 275-276, for adaptive 
cluster sampling, comparisons should be made on the basis of 
cost because it is often less expensive to sample within a cluster 
than to sample a new cluster. Therefore, comparisons based on 
cost equations would tend to favor MACS more than comparisons 
based only on expected sample size as was done in this paper. 
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Abstract. In this paper we consider the problem of parametric estimation of a 
linear model corrupted by measurement error. In order to take into account the 
biasing effects caused by the presence of an external source of error, we propose an 
adjustment to the least square estimator. The statistical properties of the adjusted 
estimator are then experimentally verified with respect to two models characterising 
the literature of images analysis. 



1 Introduction 

In this paper we consider the problem of signal extraction. The ob- 
ject is to infer the value of a signal from a record which is affected 
by superimposed noise. Before embarking on the main topics, we 
need to establish some basic results concerning minimum-mean- 
square-error prediction. The criterion which is commonly used in 
judging the performance of an estimator or predictor Y of a ran- 
dom vector Y is its mean-square error defined by E(Y — Y) 2 . Let 
{Lj} be a sequence of random variables and use a linear function 
of the values in the information set I t ={Xj/ t ',j € 5*} to predict 
the value of {Y t }. Then the prediction may be denoted by the 
linear projection of {Y t } on Xj/ t : 



Y t = X j/t f3 (1) 

The principle of orthogonality implies that the minimum-mean- 
square-error estimate of Y t is obtained by finding the coefficients 
{/3j} which satisfy the conditions 

0 = E{Xj(Y -Y)} = E(XjY) - E{XjXj)P (2) 
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where X, is the whole information set T={Xj\ j € 5 t , t = 1, 2, n} 
The precise nature of the solution depends upon the extent of the 
information set which is determined by the index set d t ; in prac- 
tice, the set will comprise only a finite (and small) number of 
elements. 

The paper is organised as follows: theoretical results are es- 
tablished in Section 2; though signal extraction is a very general 
issue, we restrict our attention to the field of spatial analysis. In 
Subsections 2.1 and 2.2, we shall deal with two different spatial 
models, and propose two least square estimators ’’adjusted” to 
take into account the presence of an external source of error. Fi- 
nally, Section 3 closes the paper with some experimental results 
both from simulations and from real observed data. 

2 Signal Extraction from a Finite Sample 

Consider the case of a signal sequence Y t and imagine that the 
observations Z t are contaminated by a white-noise error rj t , which 
is assumed to be statistically independent of the Y t . Then, for a 
set of n observations, we should have 

Z = Y + V (3) 

If the sequence Y t underlying the observations is serially corre- 
lated, then there will be some scope for deriving better estimates 
of its values than those which are provided by the sequence Z t 
of the observations. Note that this signal extraction problem may 
concern classical time-series analysis as well as spatial and image 
analysis. In this paper, represents the grey level at pixel (i, j) 
of a N x M(— n) image and we consider parameter estimation 
for two particular Markov Random Fields observed with additive 
independent identically distributed noise. 

2.1 Gaussian Markov Random Fields 

Gaussian Markov Random Fields (GMRFs) are very commonly 
used for modelling textures in image analysis and more gener- 
ally in the field of spatial data analysis; for an introduction see 
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Besag (1974). We consider parameter estimation in the presence 
of independent identically distributed (i.i.d.) noise. For particular 
boundary conditions, the implementation of optimal parameter 
estimation techniques such as maximum likelihood (ML) estima- 
tion can be computationally expensive and problematic. Maxi- 
mum pseudo-likelihood (MPL) estimation (Besag, 1977) provides 
a quick and often reasonably efficient method of parameter es- 
timation. Here, we consider adjusted MPL estimation of GMRF 
parameters in the presence of noise, making an adjustment which 
takes into account the noise. The gray levels are distributed ac- 
cording to a zero mean GMRF if the distribution of Y is multi- 
variate normal with conditional mean and variances 



E{yij\Vrs ■ (B £ $ij) — ^ ' PrsVrsi 

(r,s)^(i,j) 

Var(y ij \y rs : (r, s) £ (% ) = r 2 ; 

where 8ij is the set of neighbours of pixel (i,j) (not including 
For a homogenous model (spatially invariant) /3 rs = /3- r - s 
and /3oo = 0. The primary objective is to estimate the parame- 
ters of the (q + l)-vector 0 containing the distinct j3 rs parameters 
and r 2 in the presence of a zero mean independent identically dis- 
tributed noise with variance a n . A very quick estimation method 
is pseudo-likelihood estimation (Besag, 1977). The procedure ac- 
cepts a trade-off between efficiency and simplicity. The pseudo- 
likelihood is the product of the conditional density functions of 
each pixel given the rest and it is well known that, in the Gaussian 
case, MPL estimates can be obtained from the solution of equa- 
tion (2). So, given a realization z of a noisy GMRF, we minimize 
the following function 



N M 

^;/3) = EE(%- a V0 2 (4) 

i=l j = 1 

where /3 is the (q x 1) spatial parameter vector and 2)| = {X rs /ij\ 
(■ r, s ) £ 8ij, i = 1 = 1 ,...,M}, which we shall simply 
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define as Xij, is the (1 x g)-vector of observations on the neigh- 
bourhood of Zij. For example, considering a second order GMRF, 
we have /?=(/? 0 i, Pio, Pn, Pi~i) T and 

z i+ Returning to the general order GMRF, after differen- 
tiating (4) with respect to /?, we exploit the statistical properties 
of the measurement noise r/ij to give the Adjusted Least Square 
Estimation (ALSE): 

N M N M 

p = -E x ?j x v - xljZij} (5) 

i—l j = 1 i — 1 j — 1 

where / is the (q x q) identity matrix. The adjusted conditional 
variance estimate is then given by 

1 N M 
i=l j = 1 

Note that cr 2 (l + 2(3 T (5) is the adjustment which takes care of the 
noise. The measurement noise variance a 2 is an additional param- 
eter which is to be estimated before applying ALSE procedure. 
Several methods have been proposed in literature; for a review see 
Olsen (1993). In (5) it is assumed that parameters lie inside the 
valid parameter region. However, as in many real textures, it hap- 
pens that the high correlation in the field leads the parameters to 
lie on the boundary, defining an Intrinsic GMRF. The difficulties 
in parameter estimation may be overcome by considering con- 
strained least squares estimation: following Kiinsch (1987), who 
considered the noise-free case, we propose a Constrained ALSE, 
which assumes the following expression: 

P = P~ K- l C T (CK- l C T )~ l {C(3 -c) (7) 

where K=J2iLi YljLi XjjXij — 2 ncr 2 /, and C is a (p x g)-matrix 
which defines the linear system of the parameters which must be 
solved according to the constraints in the (p x 1) vector c. As a 
consequence, the constrained conditional variance estimate, r 2 , is 
identical to expression (6), with (3 replacing (3. 
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2.2 Markov Mesh Fields 

Markov Mesh (MM) fields (Kanal, 1980) are a subclass of Markov 
Random Fields (MRFs) that impose quarter plane causality con- 
straints on the neighbourhoods. As an MRF, a Markov Mesh field 
may be defined through its conditional probability density func- 
tion but with non-symmetric halfplane (NSHP) 0^={(r, s) : (r < 
i)U(r — i,s < j)}. More generally, one could select the neighbours 
set as a subset of the quarter plane, since there is no unique def- 
inition of causality in two dimensions. A Gaussian Markov Mesh 
may be defined directly as a one-sided white noise driven Autore- 
gressive field. Here we give an example of the third order NSQP 
(Non-symmetric Quarterplane) MM that has appeared frequently 
in image processing work. This field is given by 



Yij - (70 + 7-1,0^-i.ji + 7-i-i^-ij-i) — £ ij (8) 

with coefficients 70,-1, 7-i,o and 7-i,-i representing, respectively, 
the interaction with the horizontal, vertical and left diagonal neigh- 
bours, see Figure 1. The driving noise input variables is an in- 
dependent, zero mean, Gaussian with variance and statistically 
independent of . 
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Fig. 1. Coefficients set for MM in (8). 



Also in this case, the primary objective is to estimate the 
parameters in the presence of a noisy field. Considering model (8) 
and given a realization z of a noisy MM, it becomes natural to 
apply the ALSE as for GMRFs. The procedure, based on the new 
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information set and Dirichlet boundary conditions (Balram and 
Moura, 1993) gives 

N M N M 

i = (EE x l x » - "S'l^tEE^i (9) 

2 = 1 j = 1 2=1 j — 1 

with an adjusted variance estimate 

1 N M 

= ~ Xi ^ 2 ~ + 7 r 7 ) ( 10 ) 

i= 1 3 = 1 

3 Experimental Results 

In this section we present parametric estimation results obtained 
by means of the Adjusted Least Square Estimator. Subsection 
3.1 shows two simulation studies to asses the performance of the 
method when applied to the models presented above. Finally, in 
Subsection 3.2 two real data sets will be taken into account. 

3.1 Simulation Results 

Table 1 contains the results from parametric estimation on 100 
simulated samples (lattice size 64x64) of second order homoge- 
nous GMRF, with parameters: f3 o,i = 0.05, /?i,o = 0.2, \ = 

—0.15, Pi-i = 0.05, t 2 = 100. Gaussian zero mean i.i.d. noise 
was added, with variance =11.62, to produce an approximate 
SNR (Signal to Noise Ratio) of 10 dB. We also generated the 
same number of samples, on equal lattice size, for the MM model 
in (8). Results are reported in Table 2. The real parameters were 
fixed to: 70,-1 = 0.1; 7_i,o = 0.5; 7_i ,_i = —0.1; cr 2 = 100, with 
added Gaussian zero mean noise of variance cr 2 = 12.19. In each 
of the experimental trials, we fixed the added noise variance to 
its real value. 

As it can be noted, ALSE appears to provide good results in 
both cases. The only point to note is related to a small underes- 
timation of the driving noise variance for the MM model. 
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Table 1. Parametric estimation for second order GMRF. 



GMRF Parameters 


Ad,i 


/3i,o 


dm 


di.-i 


T 2 


Mean 


0.048 


0.200 


-0.149 


0.049 


100.028 


Standard Error 


0.015 


0.013 


0.015 


0.015 


2.998 



Table 2. Parametric estimation for MM (8). 



GMRF Parameters 


7(W 


T^i^i 


7-i,o 


o'? 


Mean 


0.103 


-0.100 


0.496 


96.130 


Standard Error 


0.017 


0.019 


0.014 


2.987 



3.2 Image Analysis Application 

In this section two real images are considered to be realisations of 
noisy GMRFs. This assumption allows for the application of the 
adjusted estimator presented in Section 2. 

The first image is a noisy Oak texture of size (128x128) that, 
in the framework of non-causal GMRF (Section 2.1), is employed 
to compare ALSE with respect to the corresponding Adjusted 
Maximum Likelihood Estimator (AMLE) presented in Dryden et 
al. (2001). In this case, the objective is to observe how ’’far” ALSE 
is from the best adjusted estimator AMLE. The noise-free image 
and the corresponding estimated parameters are presented in Fig- 
ure 2 and Table 3 respectively. 



Table 3. Parametric estimation for the Noise-free Oak texture. 



First Order GMRF 


Am 


@ 1,0 


T 2 


Ordinary Least Square 


-0.008 


0.476 


32.79 


Maximum Likelihood 


0.014 


0.484 


28.18 



The well defined vertical structure of the texture has sug- 
gested the adoption of a first order GMRF. As it can be seen 
from Table 3, the ordinary least square estimates are very close 
to those of the maximum likelihood highlighting, as was expected, 
a strong vertical interaction parameter /?io. The noisy texture has 
been obtained by adding a zero mean Gaussian noise with vari- 
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ance a 2 = 50 obtaining a log signal-to-noise ratio (LSNR, Huang, 
Cressie; 2000) equal to 5. The results of the estimation procedure 
are shown in Table 4. 



Table 4. Parametric estimation for the Noisy Oak texture. 



First Order GMRF 


0o,i 


01,0 


r2 


ALSE 


-0.018 


0.481 


31.73 


AMLE 


0.013 


0.485 


27.50 



Considering the quite strong log signal-to-noise ratio, ALSE 
and AMLE perform very well, and in particular they continue to 
provide very similar results. Furthermore, taking into account that 
the computational complexity is O(n) for ALSE and 0(n 2 ) for 
AMLE, we conclude that, particularly for large images, ALSE can 
be applied with a very quick algorithm at a low cost of precision. 
For a deeper analysis of the statistical properties of ALSE and 
AMLE the reader may refer to Dryden et al. (2001). 

Due to its good features, ALSE has been integrated in a full 
image reconstruction procedure. In this case the (128x128) Lenna 
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image (a classical benchmark) has been considered to compare a 
first order GMRF with a third order NSQP Gaussian Markov 
Mesh model. The noisy images are shown in Figure 3. a and 3.b. 



■ill 



Fig. 3. (a) Free-noise and (b) noisy Lenna 



The noisy version is characterised by an added Gaussian noise 
with variance cr^ = 850, producing an initial LSNR~ 5.5. Table 5 
shows the results of the estimation procedure for the two models. 



Table 5. Parametric estimation of noisy Lenna. Left: first order GMRF; right: 
third order NSQP GMM . 



142.83 



7-i, -i 
-0.487 



The estimated parameters have then been used as input values 
of the state space model defined in Moura and Bair am (1992). 
In this context, the Kalman filter proves to be a very efficient 
filtering and smoothing algorithm. The reconstructed images are 
presented in Figure 4. 

The effectiveness of the reconstruction procedure seems evi- 
dent for both models. On the other hand, it must be pointed out 
that the causality of the GMM involves an artificial directionality 
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Fig. 4. Reconstructed Lenna image: (a)GMRF (b) Gaussian MM 



which is not present in the GMRF. This consideration, based on 
a pure visual analysis, is confirmed by a higher final LSNR for the 
GMRF which is equal to 9.32 versus 7.57 of the GMM. 
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Abstract. The application of the dynamic regression model to real-time forecast- 
ing of air pollutant concentration points out some problems due to both the high 
frequency of sampling and the need of many-step-ahead forecasting. Some flexi- 
ble definitions of the system equation are proposed to solve these problems. The 
proposed definitions are evaluated by means of an application to the prediction of 
nitrogen dioxide concentration in Venezia-Mestre. 



1 Introduction 

Air pollutant concentrations and other related variables are usu- 
ally collected in real time and with high frequency of sampling 
from monitoring network. One of the principal aims in this con- 
text is to produce real time forecasting of the pollutant concentra- 
tions, in order to evaluate the probability of crossing over warning 
or high risk thresholds. A statistical model cannot represent ex- 
actly the underlying chemical-phisical process. Therefore, flexible 
models or models with varying parameters can be defined. Dy- 
namic regression models (Harrison and Stevens, 1976; West and 
Harrison, 1997) are a class of flexible models useful for this kind 
of problems. In fact, they allow to both model many kinds of de- 
pendence between observed variables (time dependencies, space 
dependencies, space-time dependencies) and include some singu- 
larities like outliers, influential data and change points. In these 
models, the system equation describes a law of evolution of the 
regression parameters vector over the time. In the paper, a more 
flexible definition of the system equation is proposed, which con- 
siders the many- step- ahead prediction errors within a sequential 
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filtering algorithm. The proposed solutions are evaluated consid- 
ering an application to the real time prediction of nitrogen dioxide 
concentration in Venezia-Mestre. Mestre is the dry-land part of 
Venice. 

2 Dynamic Regression Models and Real 
Time Prediction 

Consider a time series y t , t = 1,2, .. . and suppose to be mainly 
interested in the prediction, at each time t, of y t+ 1 , yt+ 2 , ■ ■ ■ , Vt+s- 
To this purpose the values of a set x t of k regressors are available, 
at each time t, and denote with 9 t the the A;-dimensional vector 
of regression parameters at time t. The dynamic linear regression 
model is a generalisation of standard static regression model and 
it is defined by the following state-space representation: 



Vt = x' t $ t + e t (1) 

6t = + Uf ( 2 ) 

provided a vector of parameter 9 0 at an initial time t = 0 can be 
considered. e t and u t are random errors, usually assumed sequen- 
tially and mutually uncorrelated. 

Equation (1) is the measurement (or observation) equation 
and is an usual regression equation, where the parameters vector 
9 t is not constant, for all t. Following the terminology of the state- 
space models, 9 t is the state vector and e t is the measurement 
error. 

Equation (2) is the system (or transition) equation and de- 
scribes a law of evolution of the regression parameters vector from 
one time to the following, where u t is the system (or transition) 
error. The state-space representation of dynamic regression mod- 
els is quite general and allows to consider general methods for 
inference (filtering) and prediction from a wide literature about 
dynamic system. A seminal method, the Kalman filter (Kalman, 
1960; Harrison and Stevens, 1976), allows the bayesian analysis of 
the dynamic regression model when e t JV [0, u t ~ N [0, W t ], 
and p(9 0 ) = M [9 0 \m 0 , jP 0 ], provided M t , v t , W t , m 0 , Pq are known. 
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Denote with Y t ° = (Y°_ x , y°) the information about y t available at 
time t, for t = 1,2,..., with F 0 ° = 0 . Following a Bayesian inter- 
pretation, the Kalman filter is essentially a set of recursive equa- 
tions that, considering the posterior distribution p{9 t ^i\Y^_ l ) = 
of the state vector at time t — 1 and the ob- 
served value y°, gives the parameters m t and P t of the posterior 
distribution p(9 t \Y°) = Af [9 t \m t , Pt\ of the state vector at time 
t and the predictive distribution p(y t+1 , y t+2 , . . . , ?/ t+s |y 0 ) of fu- 
ture observation of the time series. Within the recursions of the 
Kalman filter, transition equation allows to define the prior dis- 
tribution p(9 t \Y^_ 1 ) from the posterior distribution p(9 t -i\Y t °_ 1 ) . 
Many extensions of the Kalman filter have been proposed within 
the Bayesian approach (West and Harrison, 1997). Recently, se- 
quential Monte Carlo filters have been proposed as interesting 
alternatives to the Kalman filter (see: Doucet et ah, 2001 for a 
comprehensive review). Sequential Monte Carlo filters allow to 
consider, in a quite general way, non-linearity in the measurement 
and system equations and non-normality of the errors. 

3 Sequential Filtering and Hyper parameter 
Estimation 

Some aspects have to be focused at this point, if we are interested 
in real-time prediction of yt+i, yt+2, ■ ■ ■ , Vt+s ■ The first aspect is 
the recursive nature of the Kalman filter. At time t, the distri- 
bution of the regression parameters vector I % depends, given the 
posterior distribution p(9 t -i\Y°_ 1 ) only on the data observed at 
time t and not on the complete sequence F) 0 . This is absolutely 
relevant when data are collected with high frequency as in air 
pollution monitoring networks. In these cases, the dataset quickly 
increases and the use of prediction methods which have to consider 
at each time t the complete sequence Y® could be difficult. The 
second aspect to be focused concerns the so-called hyperparame- 
ters: M t , v t , W t , m 0 , Po ■ These parameters are assumed as known 
in the Kalman filter but they are not known in the applications. 
The expected value m o and the variance Po are the parameters of 
the prior distribution of the state vector at t = 0 and they can be 
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set subjectively by the researcher, given the initial state of infor- 
mation. However, their importance in the posterior distribution 
p(9 t \Yt) become small as t increases. For the system parameter 
Mt some recursive models have been proposed (see: Mantovan et 
al., 1999, 2000). In the following, we will assume that: 

M t = /, Vf (3) 

For the variance v t of the measurement error, if y t is a univari- 
ate time series, a standard Bayesian sequential learning analysis is 
possible, via a straightforward extension of the Kalman filter. If y t 
is a multivariate time series, some approximate bayesian solutions 
have been proposed (Mantovan and Pastore, 1999). As regards the 
estimation of the variance W t of the system error, most of the lit- 
erature is devoted to non-sequential procedures. There are two 
relevant approaches. One is concerned with maximum likelihood 
estimation, obtained by means of numerical methods, like scoring 
and E.M algorithm (Me Whorter et al., 1976; Shumway and Stof- 
fer, 1982; Watson and Engle, 1983; Ruud, 1991). More recently, 
within the bayesian approach, many authors proposed the use 
of MCMC methods (Carlin et al., 1992; Carter and Kohn, 1994; 
Friihwirth-Schnatter, 1994). Therefore, there are not solutions for 
the sequential learning of the variance W t . 

4 The System Equation as a Tool for the 
Model Flexibility 

A more relevant question concerns the meaning of the system 
equation itself. In most of control theory and engineering appli- 
cations, where a wide part of literature about dynamic system 
have been produced, the system equation is the representation of 
a known law of evolution of a system. In the context of dynamic 
regression models, the system equation represents a suitable law 
of evolution of the regression parameters vector, which is intro- 
duced to allow the models to be more flexible. Many alternatives 
are possible, as illustrated by Hastie and Tibshirani (1993) with 
the related inferential issues. Assume, as in equation (2), the evo- 
lution of the system parameter to be linear and markovian. If the 
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assumption (3) about M t holds, then the specification of the vari- 
ance of the system error becomes relevant. A simple solution has 
been proposed by West and Harrison (1997). The variance matrix 
W t is set in order to obtain: 

var(^|F t °_ 1 ) = (1 + A) • var(0 t _ 1 |y t °_ 1 ) (4) 

with A > 0, or, equivalently: 

var“ 1 (^|l)^_ 1 ) = 5 ■ (var(^_i|l^°_ 1 )) _1 (5) 

with 5 = (1 + A) -1 . Equations (4) and (5) highlight that the 
variance of the prior distribution, at time t , is greater than the 
variance of the posterior distribution at time t — 1 and that the ra- 
tio between these two variances may be specified (or controlled ) by 
a parameter, either A or S . West and Harrison call S the discount 
parameter and suggest setting 0.80 < S < 1. This approach allows 
to control the effect of the system error variance, but it has to be 
set by the researcher. A sequential learning on 5 can be defined via 
a multiprocess model which introduces mixture approximation of 
the parameters vector distribution at each time t (Harrison and 
Stevens, 1976; West and Harrison, 1997). Considering the appli- 
cation of the dynamic regression model to the prediction of real 
data with high frequency of sampling (for instance: air pollution 
monitoring data), two other problems arise: 

1. Frequency of transition. In the Kalman filter the parameters 
vector has a transition each time that a new value of y t is ob- 
served. In other words, the frequency of sampling of y t is also 
the frequency of the transition of the parameters vector. But 
in many automatic monitoring systems, the frequency of mea- 
surement (or sampling) is defined by measurement instrument 
characteristics or as function of the measures cost and it can 
be considered as independent from the frequency of transition 
of the parameters vector. 

2. Many-step-ahead prediction errors. In the Kalman filter the 
posterior distribution of the parameters vector depends on the 
one-step-ahead prediction error. In many cases, with high fre- 
quency data, the researcher is interested in many-step-ahead 




270 



Mantovan and Pastore 



predictions. For instance, for hourly data, 24-hours-ahead (one- 
day-ahead) predictions could be interesting. If possible, the 
learning rule on the parameters vector should include also 
many-step-ahead prediction errors. 

5 Flexible Definition of the System Equation 

In order to avoid these problems, it is possible to propose more 
flexible definition of the system equation. Depending on specific 
application, some basic ideas can be combined in a specific so- 
lution. Here, some ideas and a specific solution are illustrated, 
which allows defining sequential filtering procedures more flexible 
than the usual solutions, by suitable modifications of the transi- 
tion phase. Some basic ideas can be proposed about the transition 
frequency of the parameters vector. 

Fixed lag transition. In a simple way, the transition could be de- 
fined at a fixed lag, that is every k times, k > 1 (for instance, for 
hourly data, with k = 12 or k = 24) or according to fixed interval 
of times, considering some characteristics of the variable y t . For 
instance, in a model for prediction of NO 2 hourly concentration 
the transition could be fixed in specific hours with respect to the 
vehicle traffic. The idea is quite simple, but, as detailed in the 
following, it could improve the many-step-ahead prediction. 

Prediction-error-driven transition. The transition could happen 
only when the model has produced bad predictions, considering 
both one-step-ahead and many-step-ahead predictions. Let: 

m ,-j = e 

and 

Qt\t-j = var [y t \Y t °_j] 

be respectively the expected value and the variance of the predic- 
tive disribution of y t at time t — j, j >0. Moreover, let r t be the 
number of observation at time t after the last transition. At time 
t, if j < r t , Vt\t—j is the last j-step-ahead prediction provided by 
the model. A function: 

ht Qt\t— j = 
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of the prediction errors and a threshold r could be defined, for 
a suitable value of s, such that at time t the transition happens 
only if h t (-) > r. 

An example of definition of ht(-) is the following: 



ht(') 



min (s,rt) 

£ 



3 = 1 



-| 1/2 



(y° t - y t \t-j) 2 Kt,; 



min (s,rt) 

with 7T(J = 1 

3 = 1 



( 6 ) 

Other definition of the function h t (-) can include an evalua- 
tion of prediction errors based on some specified utility function. 
Some alternative solutions for h t (-) can be found in Huang and 
Cressie (1996). This idea explicitly includes the many-step-ahead 
prediction errors in the filtering procedure. 

Bayes-factor-driven transition. This approach was proposed as 
automatic monitoring by West (1986). Essentially, at each time, 
in the transition phase two model (with and without) are consid- 
ered and are compared via the predictive Bayes factor, based on 
the one-step-ahead predictive distributions under the two models. 
As to the variance of the system error, there are two solutions. 



Discounting approach. The effect of the transition on the variance 
of the prior distribution p{d t \y^_ l ) is controlled via the discount 
parameter (West and Harrison, 1997). 

Optimal discount via the Bayes factor. An extension of the bayes- 
factor driven transition can be introduced. Assume that, at each 
time t, the effect of the transition can be represented by a subset 
A t of the parameters of the predictive distribution, that, for sim- 
plicity, can be denoted as P\ t (yt\Yt-i) = PiVt^t-i^t) ■ Moreover, 
assume that if A t = 0 the transition has no effect. Then, the Bayes 
factor: rjt(^t) = PxAVt lb°)/.Po(2/t |Tf°) can be considered as a func- 
tion of A t . At each time, an optimal value of A t may be defined as: 
At = argmaxA ( r/ t ( X t ) under the constraint 0 < At < A + , to avoid 
the model to follows outliers. As for the Bayes-factor-driven tran- 
sition , in many parametric cases, At has a simple analytic form 
and the computation is very easy. These ideas can be combined 
as follows in a solution which both considers the many-step-ahead 
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prediction errors and chooses an optimal effect of transition. At 
each time t, the transition is driven by a function h t (-) of many- 
step-ahead prediction errors (as defined before), and by a suitable 
threshold r for this function. If h t (-) > r, then the transition hap- 
pens and an optimal value of A t can be evaluate defining a Bayes 
factor as follows: 



/ ^ PXt {ylylv---,ylr n i n (s,r)\ Y t 0 -min(s,r)j _ 

^(A t ,min (s,r t )) = —f — { —f (7) 

Po (y?, yl 1, ■ ■ • > yt- m in(s,r) \ Y t°-min(s,r) J 

and getting: A t = argmax At r? t (A t , min(s, r t )). Following this solu- 
tion, the decision about the transition is a function of prediction 
errors, while the amount of the transition (in terms of variance in- 
creasing) is determined via the predictive Bayes factor computed 
for all the predictive distributions of the current model. 

6 On Line Forecasting of Nitrogen Dioxide 
Concentrations 

In order to illustrate the proposed solutions, we consider a set of 
data from the air pollution-monitoring network of Venezia Mestre. 
These data are collected in real time and in this context a sta- 
tistical model is required to give many-step-ahead forecasts in a 
very short time. In fact, if the concentration of a pollutant be- 
comes higher than a warning threshold, public authorities must 
implement public intervention (such that traffic and house heat- 
ing restrictions) in order to reduce the concentration. The subset 
of data considered in this paper concerns hourly measures of NO 2 
in one station of the network, in the period between 01.01.97 and 
08.02.97 (936 observations). The following regressors were consid- 
ered for the prediction of NO 2 at time t: a weekly periodic com- 
ponent estimated by traffic data, the thermal gradient at time 
t — 8 , the wind speed at time t — 1. One-hour and 24-hours-ahead 
predictions (expected value of predictive distributions) have been 
obtained with different transition frequencies and with different 
solutions for the amount of the transition effect. The predictive 
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Table 1 . Predictive performances of the fixed-lag transition model, for different 
values of k 



Frequency of 
Transition 


Fixed lag | 


k = 1 


k — 12 


II 

to 


Effect of 
transition 


Fixed 


Fixed 


Fixed 


W = 0.1/ 


W = 0.1 / 


W = 0.1/ 


no. of transition 


936 


78 


39 


24-hrs-ahead pred. 


0.778 


0.783 


0.762 


1-hr-ahead pred. 


0.784 


0.802 


0.799 



Table 2. Predictive performances of the prediction-error-driven transition model, 
for different weight functions 



Frequency of 
Transition 


Prediction errors | 


(a) 


(b) 


Effect of 
transition 


Fixed 


Fixed 


W = 0.1/ 


W = 0.1/ 


no. of transition 


64 


119 


24-hrs-ahead pred. 


0.777 


0.775 


1-hr-ahead pred. 


0.830 


0.870 



performance has been evaluated by means of the square correla- 
tion coefficient between observed and predicted value (both for 
1-hour ahead and 24-hours ahead forecasting). The number of 
transition occurred have been also considered. 

First, we considered the model with fixed-lag transition (ev- 
ery k times) and a fixed variance matrix W t = W = 0.1/. Three 
values of k were considered: k = 1 (transition every hour - usual 
Kalman filter solution), k = 12 (transition at 7 a.m. and at 7 p.m. 
every day), and k = 24 (transition at 3 a.m., every day). The re- 
sults, reported in Table 1, highlight as a different lag of transition 
may improve the predictive performance of the model. The lag 
of transition parameter could has the same importance and the 
same effect on the predictions than the system error variance. 

Then, we considered a model with a fixed system matrix W t = 
W = 0.1/, but with prediction error driven transition. We con- 
sider the function defined by equation (7) with two different choices 
for the weights 7r t j\ 

(a) at each time t, ir t j uniform, for all j; 
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Table 3. Predictive performances of the Bayes-factor driven transition model, for 
both fixed and variable effects of transition 



Frequency of 
Transition 


Bayes Factor | 


k = 1 


k = 12 


k = 24 


Effect of 
transition 


Fixed 


Discount 


Discount 


W = 0.17 


A = 0.05 


A = 0.25 


no. of transition 


66 


80 


135 


24-hrs-ahead pred. 


0.761 


0.780 


0.782 


1-hr-ahead pred. 


0.826 


0.813 


0.833 



Table 4. Predictive performances of the optimal discount transition model, for 
different weight functions 



Frequency of 
Transition 


| Prediction errors | 


(a) 


(b) 


Effect of 
transition 


Bayes 


Bayes 


Factor 


Factor 


no. of transition 


67 


106 


24-hrs-ahead pred. 


0.795 


0.790 


1-hr-ahead pred. 


0.866 


0.858 



(b) at each time t: 



_ . = / 1 if f = mill ( S ’ T t) (O) 

\ 0 otherwise ' 

In the second case, we are particularly interested in the 24- 
hours-ahead predictions. In both cases the threshold was set equal 
to 30. Results are Table 2. Here the predictive performances are 
better than the cases with fixed-lag transition. Note as the number 
of occurred transitions is not high, compared with the number of 
considered observations. 

Models with Bayes-factor driven transition have been consid- 
ered with both a fixed variance matrix W t = W = 0.1 1 and setting 
following the discount approach, with A = 0.05 (corresponds to 
6 ~ 0.95) and with A = 0.25 (corresponds to S w 0.80). Results 
are reported in Table 3. 

Finally, we considered the model with prediction error driven 
transition and optimal transition effect via the Bayes factor, again 
with the function defined by equation (7) and with the two dif- 
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ferent choices for the weights ir t j described before. The results, 
reported in Table 4, appears to be better than in the other cases, 
particularly for 24-hours-ahead predictions. 

^,From these results it is quite clear that the predictive perfor- 
mances of the model can be improved with an accurate setting of 
both the frequency and the effect of the transition equation. The 
improvement is particularly noticeable for 24-hours-ahead predic- 
tion. Moreover, we notice that both prediction error and bayes 
factor driven transition models require a reduced number of tran- 
sitions. 

References 

CARLIN, B., POLSON, N., and STOFFER, D. (1992): A Monte Carlo approach 
to nonnormal and nonlinear state-space modeling. Journal of the American 
Statistical Association, 87, 493-500. 

CARTER, C.K. and KOHN, R. (1994): On Gibbs sampling for state-space models. 
Biometrika, 81, 541-553. 

DOUCET, A., DE FREITAS, N., and GORDON, N. (Eds.) (2001): Sequential 
Monte Carlo Methods in Practice. Springer, New York. 
FRUHWIRTH-SCHNATTER, S. (1994): Data augmentation and dynamic linear 
models. Journal of Time Series Analysis, 15, 183-202 
HARRISON, P.J. and STEVENS, C. (1976): Bayesian Forecasting (with discus- 
sion). Journal of the Royal Statistical Society (Ser. B), 38, 205-247. 

HASTIE T. and TIBSHIRANI R. (1993): Varying coefficient model. Journal of the 
Royal Statistical Society (Ser. B), 55, 757-796. 

HUANG, H-C and CRESSIE, N. (1996): Spatio temporal prediction of snow water 
equivalent using the Kalman filter. Computational Statistics and Data Analysis, 
22, 159-175. 

KALMAN, R.E. (1960): A new approach to linear filtering and prediction problems, 
Journal of Basic Engineering, 82D, 33-45. 

MANTOVAN, P. and PASTORE, A. (1999): Marginal updating equation for mea- 
surement error variance matrix and state vector estimates in the dynamic linear 
model. In: H. Bacelar Nicolau et al., eds): Applied Stochastic Models and Data 
Analysis Proceedings of the IX International Symposium, Instituto Nacional de 
Estatstica, Portugal, 212-217. 

MANTOVAN, P., PASTORE, A., and TONELLATO S. (2000): A comparison be- 
tween parallel algorithms for system parameter identification in dynamic linear 
models. Appl. Stochastic Models Bus. Ind, 15, 369-378. 

MC WHORTER, A., SPIVEY, W.A., and WROBLESKI, W.J. (1976): A sensitiv- 
ity analysis of varying parameter econometric models. International Statistical 
Review, 44, 265-282. 

RUUD, P.A. (1991): Extension of estimation methods using the EM algorithm. 
Journal of Econometrics, 45, 305-341. 




276 



Mantovan and Pastore 



SHUMWAY, R.H. and STOFFER, D.S. (1982): An approach to time series smooth- 
ing and forecasting using the EM-algorithm. Journal of Time Series Analysis , 
3 , 253-264. 

WATSON, M.W. and ENGLE, R.F. (1983): Alternative algorithms for the esti- 
mation of dynamic factor, MIMIC and varying coefficient regression models. 
Journal of Econometrics , 23 , 385-400. 

WEST, M. (1986): Bayesian model monitoring. Journal of the Royal Statistical 
Society (Ser. B), 48, 70-78. 

WEST, M. and HARRISON, P.J. (1997): Bayesian Forecasting and Dynamic Mod- 
els (2nd ed.). Springer, New York 




Author Index 




A 


F 


Agati P. 195 
Amendola A. 147 


Fontanella L. . 65 


Amenta P. 159 
Anitori P. 169 


G 


B 


Gaetan C. 233 
Gattone S. A. 245 
Giordano F. 107 
Granturco M. 65 


Baragona R. 133 
Bilancia M. 209 
Bologna S. 219 


I 




Ippoliti L. 53, 255 


c 


L 


Cal D. G. 119 


La Rocca M. 107 


Cappelli C. 3 
Chiogna M. 233 
Conversano C. 93 


Lovison G. 219 




M 


D 


Mantovan P. 265 
Mendola D. 181 
Miglio R. 25 


De Cantis S. 181 
de Carvalho F. de A. T. 15 
De Gregorio C. 169 
de Souza R. M. C. R. 15 
Di Battista T. 245 


Mola F. 3 

N 


Di Giacinto V. 53 


Niglio M. 79 




278 



Author Index 



P 

Pagnotta S. M. 79 
Pastore A. 265 
Perna C. 107 
Piccarreta R. 37 
Pillati M. 119 
Pollice A. 209 



R 

Romagnoli L. 53,255 



s 

Sarnacchiaro P. 159 
Soffritti G. 25 
Storti G. 147 



V 

Verde R. 15 
Vitrano S. 133 




Subject Index 



A 

Abundance 248 

Adaptive Cluster Sampling 246 
Adaptive Sampling 245, 247 
Adjusted Least Square 255,258 
Adjusted Likelihood 54 
AIC 11,94 

Air Pollution 233,265 
Analysys of Poverty 181 
ARMA(p,q) 80 
Asymmetric Specification 219 
Asymmetry 148 
Axiomatic Procedures 195 



B 

Backfitting Algorithm 100 
Bayesian Combinations 195 
Bayesian Hierarchical 
Modelling 209, 210 
Biological Population 245 
Bootstrap 96, 107, 110 
Business Surveys 171 



c 

CART 5, 37, 38 
Chemical-Physical Process 265 
Classification Trees 3, 25,37, 38 
Classifier 23 

Conditional Autoregressive 



(CAR) 210 
Confusion Tables 44 
Crossover 136 

Customer Satisfaction 159, 163 



D 

Data Mining 16, 116 
Disease Mapping 209 
Dissimilarity 18, 33, 122 
Diversity 246, 249 
Dynamic Generalized Linear 
Models (DGLM) 235, 236 
Dynamic Regression 265, 266 



E 

Economic Time Series 147 
Environmental Epidemiology 
233 

Environmental Statistics 233 
Environmental Studies 248 
Equitability 249 
Euclidean Distance 125 
Exponential Power Function 
133 



F 

Fitness Function 135 
Fourier Analysis 66 




280 Subject Index 



Fuzzy Dynamic 181 
Fuzzy Sets 188, 190 

G 

Generalized Additive Models 
(GAM) 93, 234 
Generalized Gamma 

Distribution 219, 221 
Generalized Linear Model 234 
Generalized Linear Mixed 
Model (GLMM) 209 
Genetic Algorithm 133, 134 
Gini-Simpson Criterion 48 

H 

Heteroscedasticity 79, 89, 147 
Hyperparameters 212, 267 

I 

Image Analysis 261 

Industrial Production Index 

147 

Interaction 219 
Inversion 136 

K 

Kalman Filter 56, 268 
K-means 126 



L 

Lag 55 

Linear Model 255 

Locally Weighted Polynomials 

93 

Lognormal Model 249 



M 

Manhattan Distance 125 
Maximum Likelihood (ML) 257 
Maximum Pseudo-Likelihood 
(MPL) 257 
MCMC 212 

Meteorological Variables 239 
Moment Interaction 220 
Mortality 233 

Multidimensional Scaling 187 
Multi-Modules 163 
Mutation 136 

Mutual Neighbourhood Graph 
16 



N 

Neural Network 107, 119 
Neural Network Estimates 112 
Noisy Images 255, 263 
Non Linearity 147 
Normal Distributions of Order 
p 133 




Subject Index 281 



P 

Partitioning Around Medoids 
(PAM) 123 

Performance Indicators 169 
Periodigram 70 
Poverty 182, 186 
Principal Component Analysis 
169, 187 

Product Ratios (RCPR) 222 
Promixity Measures 25 
Pruning 31, 109 



R 

Radial Basis Functions 119 
Random Fields 256 
Real-Time Forecasting 265 
Reproduction 136 
Robust Centre Location 119 



s 

Seasonal Components 70, 80 
Seasonal Fluctuations 241 
Second Order Asymmetric 
Interaction (SOAI) 225 
Second-order Interaction 219 
Sequential Consulting 195 
Serial Correlation 233 
Signal Extraction 65, 255, 256 
Signal-to-Noise- Ratio 60 
Similarity 27 
Smoothing 93, 96 



Smoothing Splines 93 
Smoothness 95 
Spatial Clustering 209 
Spatial Contiguity Matrix 55 
Spatial Median 119,121 
Spatio-Temporal Series 53 
Spectral Analysis 65 
Splitting Variables 8 
Stationarity 79 
Stationary Signal Process 68 
Statistical Ecology 245 
Stocastic Process 65 
Stopping Rules 195 
STP 11 

Symbolic Data 16 

T 

Threshold Autoregressive 
(TAR) 148 
Time Series 79, 108 
Total Quality 159 



V 

Variables Selection 110 
Volatility 148 

w 

Wavelet 72 
Wholesale 169 




